# kg Python package demos

---

In [17]:
from IPython.core.display import display, HTML

from SPARQLWrapper import SPARQLWrapper, JSON

from kg.ner.unsupervised import NounPhraseDetector, ProperNounDetector, ClusterEntityTypeDetector
from kg.ner.supervised import PyTorchTypeDetector, PretrainedEntityDetector, SpacyEntityTypeDetector
from kg.ner.utils import prepare_entity_html, prepare_entity_link_html
from kg.el.entity_db import EntityDB

### "A nor’easter grounded flights departing from New York City this weekend. LaGuardia, JFK, and Newark were all impacted though transportation officials expect normal operations to resume by early Monday morning."

In [19]:
sample_sentence = ("A nor’easter grounded flights departing from New York City this weekend. "
                   "LaGuardia, JFK, and Newark were all impacted though transportation "
                   "officials expect normal operations to resume by early Monday morning.")

---

## Entity Detection

In [20]:
# unsupervised, developed from scratch
proper_noun_detector = ProperNounDetector()
output = proper_noun_detector.extract(sample_sentence)
entity_html = prepare_entity_html(output)
display(HTML(entity_html))

In [21]:
# unsupervised, noun phrase detector
noun_phrase_detecor = NounPhraseDetector()
output = noun_phrase_detecor.extract(sample_sentence)
entity_html = prepare_entity_html(output)
display(HTML(entity_html))

In [4]:
# supervised, NLTK's binary entity detection
pretrained_detector = PretrainedEntityDetector()
output = pretrained_detector.extract(sample_sentence)
entity_html = prepare_entity_html(output)
display(HTML(entity_html))

---

## Named Entity Detection 

In [5]:
config_file_path = '/Users/tmorrill002/Documents/datasets/conll/model/runs/20210810-112027/config.yaml'
pytorch_detector = PyTorchTypeDetector(config_file_path)
output = pytorch_detector.extract(sample_sentence)
entity_html = prepare_entity_html(output, binary=False)
display(HTML(entity_html))

In [6]:
# supervised, NLTK's entity detection
pretrained_detector = PretrainedEntityDetector(binary=False)
output = pretrained_detector.extract(sample_sentence)
entity_html = prepare_entity_html(output, binary=False)
display(HTML(entity_html))

In [7]:
# supervised, spaCy's entity detection
type_detector = SpacyEntityTypeDetector()
output = type_detector.extract(sample_sentence)
entity_html = prepare_entity_html(output, binary=False)
display(HTML(entity_html))

#### TODO: refactor cluster based procedure:
- cluster embedded noun phrases or proper nouns using K-Means (or any clustering algorithm)
- determine the optimal number of clusters $k$ through distance metrics
- have a human review a sample of phrases from each cluster and assign each cluster a semantic label (e.g. Locations, etc.)
- use this set of labels for future classifications

---

## Entity disambiguation (and Wikification)

**The problem:**
'New York City' also known as:
- NYC
- New York
- the five boroughs
- Big Apple
- City of New York
- NY City
- New York, New York
- New York City, New York
- New York, NY
- New York City (NYC)

**The big idea**: take advantage of the link structure present in Wikipedia articles to capture the many ways that entities are represented.

For example, on the [Computer Science](https://en.wikipedia.org/wiki/Computer_science) Wikipedia page, you'll notice that *algorithmic processes* links to [Algorithm](https://en.wikipedia.org/wiki/Algorithm).

In [9]:
queries = ['New York', 'Mumbai', 'Shanghai']

In [10]:
db_file_path = '/Users/tmorrill002/Documents/datasets/wikipedia/20210401_sqlite/db.sqlite'
entity_db = EntityDB(db_file_path)
query_results = entity_db.query(queries, k=2)
for idx, query in enumerate(queries):
    display(HTML(entity_db.query_result_html(query, query_results[idx])))

INFO:sqlitedict:opening Sqlite table 'unnamed' in '/Users/tmorrill002/Documents/datasets/wikipedia/20210401_sqlite/db.sqlite'


---

## Combine NER with Wikification

In [13]:
# supervised, spaCy's entity detection
output = type_detector.extract(sample_sentence)
html = prepare_entity_link_html(output, entity_db)
display(HTML(html))

---

## Link to Wikidata and answer questions

**Consider the question:** How many people live in New York City?

The [New York City](https://www.wikidata.org/wiki/Q60) Wikidata entry has the following information:
- **population**: 8,398,748±10,000
    - **point in time**: 1 July 2018
    - **determination method**: project management estimation
    - **reference**:
        - **stated in**: Population Estimates Program
        - **issue**: 2018
        - **reference URL**: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=CF
        - **retrieved**: 15 December 2019

**SPARQL Query for [population of New York City](https://w.wiki/3s5m)**

```
# Population of New York City
SELECT ?count WHERE {
  wd:Q60 wdt:P1082 ?count.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
```

where *wd:Q60* is the entity ID for New York City, and *wdt:P1082* is the relation for population.

In [32]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

In [33]:
sparql.setQuery("""
SELECT ?count WHERE {
  wd:Q60 wdt:P1082 ?count.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

In [44]:
population = results['results']['bindings'][0]['count']['value']
print(f'The population of New York City is: {int(population):,}')

The population of New York City is: 8,398,748
