<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/notebooks/Spacy_Entity_Linker_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is to demo how to use `spacy entity linker`.

Comments:
1. Pretty easy to use but it captures a lot of unwanted entities?
2. Entity disambiguation part might be noisier than i expected. For instance, the word `friday` will be disambiguated as a film.
3. AFAIK i cant select/disambiguate a certain type of ner. For instance, i want to perform entity disambiguation on PER only?


In [1]:
# spacy==3.4.4M
!pip freeze | grep spacy

en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl
spacy==3.4.4
spacy-legacy==3.0.12
spacy-loggers==1.0.4


In [None]:
!pip install spacy-entity-linker

In [None]:
!python -m spacy_entity_linker "download_knowledge_base"

In [None]:
!python -m spacy download en_core_web_md

In [11]:
import spacy  # version 3.5

# initialize language model
nlp = spacy.load("en_core_web_md")

# add pipeline (declared through entry_points in setup.py)
nlp.add_pipe("entityLinker", last=True)

doc = nlp("I watched the Pirates of the Caribbean last silvester")

# iterates over sentences and prints linked entities
for sent in doc.sents:
    sent._.linkedEntities.pretty_print()

# OUTPUT:
# https://www.wikidata.org/wiki/Q194318     Pirates of the Caribbean        Series of fantasy adventure films                                                                   
# https://www.wikidata.org/wiki/Q12525597   Silvester                       the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches)  

# entities are also directly accessible through spans
#doc[3:7]._.linkedEntities.pretty_print()
# OUTPUT:
# https://www.wikidata.org/wiki/Q194318     Pirates of the Caribbean        Series of fantasy adventure films

<EntityElement: https://www.wikidata.org/wiki/Q194318 Pirates of the Caribbean  Series of fantasy adventure films                 >
<EntityElement: https://www.wikidata.org/wiki/Q12525597 Silvester                 the day celebrated on 31 December (Roman Catholic Church) or 2 January (Eastern Orthodox Churches)>


In [21]:
news_content = """SINGAPORE: Mr Lee Hsien Yang declared on Friday (Mar 3) that he is considering running for the Elected Presidency, but lawyers said that earlier court findings that he and his wife had lied under oath in judicial proceedings could see him fail to meet the criteria of being a candidate.

This is regardless of the outcome of ongoing police investigations into the couple for potential offences of giving false evidence in the proceedings over Singapore’s founding Prime Minister Lee Kuan Yew’s will, the lawyers added.

In an interview with news outlet Bloomberg on Friday, Mr Lee Hsien Yang said that he is considering running in the Presidential Election in Singapore, which will be held later this year. President Halimah Yacob’s six-year term is due to expire in September.

Mr Lee said to Bloomberg, in reference to the ruling People’s Action Party (PAP), that there is a view that if he were to run, PAP "would be in serious trouble and could lose”, depending on who the party chooses as a candidate.

“A lot of people have come to me. They really want me to run. It’s something I would consider,” he added.

Mr Lee Hsien Yang is Prime Minister Lee Hsien Loong’s brother, and Mr Lee Kuan Yew was their father.
"""

news_content = news_content.replace("\n\n", " ")

In [32]:
doc = nlp(news_content)

for sent in doc.sents:
    print(f"{sent = }")
    sent._.linkedEntities.pretty_print()
    print("*"*100)

sent = SINGAPORE:
<EntityElement: https://www.wikidata.org/wiki/Q334 Singapore                 Southeast asia city state                         >
****************************************************************************************************
sent = Mr Lee Hsien Yang declared on Friday (Mar 3) that he is considering running for the Elected Presidency, but lawyers said that earlier court findings that he and his wife had lied under oath in judicial proceedings could see him fail to meet the criteria of being a candidate.
<EntityElement: https://www.wikidata.org/wiki/Q2904984 Lee Hsien Yang            Singaporean business executive                    >
<EntityElement: https://www.wikidata.org/wiki/Q673486 Friday                    1995 film directed by F. Gary Gray                >
<EntityElement: https://www.wikidata.org/wiki/Q110 March                     third month in the Julian and Gregorian calendars >
<EntityElement: https://www.wikidata.org/wiki/Q3558349 presidency          

In [33]:
# summary info of linked entities
# noticed there are duplicates of Lee Kuan Yew
doc._.linkedEntities.print_super_entities()

human (10) : Lee Hsien Yang,candidate,Lee Kuan Yew,Lee Hsien Yang,Halimah Yacob,Ailee,candidate,Lee Hsien Yang,Lee Hsien Loong,Lee Kuan Yew
city (3) : Singapore,Singapore,Singapore
country (3) : Singapore,Singapore,Singapore
island nation (3) : Singapore,Singapore,Singapore
city-state (3) : Singapore,Singapore,Singapore
port city (3) : Singapore,Singapore,Singapore
sovereign state (3) : Singapore,Singapore,Singapore
position (3) : candidate,candidate,brother
political party (3) : Action Party,People's Action Party,People's Action Party
film (2) : Friday,Friday


In [44]:
for linked_entities in doc._.linkedEntities:
    print(f"orignal label: {linked_entities.label}")
    category_labels = [category.label for category in linked_entities.get_super_entities()]
    print(f"category labels: {category_labels}")
    print("*"*100)

orignal label: Singapore
category labels: ['city', 'country', 'island nation', 'city-state', 'port city', 'sovereign state']
****************************************************************************************************
orignal label: Lee Hsien Yang
category labels: ['human']
****************************************************************************************************
orignal label: Friday
category labels: ['film']
****************************************************************************************************
orignal label: March
category labels: ['calendar month', 'month of the Gregorian calendar']
****************************************************************************************************
orignal label: presidency
category labels: ['executive branch']
****************************************************************************************************
orignal label: lawyer
category labels: ['legal professional', 'jurist']
*************************************

In [42]:
doc._.linkedEntities[1].label

'Lee Hsien Yang'