# Toponym Resolution with T-Res

This notebook demonstrates the use of the T-Res HTTP API for performing toponym resolution.

Toponym resolution refers to the task of identifying place names (toponyms) in a piece of text and linking each of them to a known physical location.

This process involves three distinct steps:
 1. **named entity recognition** to identify which characters in the text are in fact toponyms
 1. **candidate selection** to generate a list of candidate places within a knowledge base
 1. **entity linking** to determine which candidate place is the best match for the given toponym

[T-Res](https://github.com/Living-with-machines/T-Res) is a software tool that provides an end-to-end pipeline for toponym resolution, using [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page) as its knowledge base.

The T-Res HTTP API enables users to make toponym resolution queries to a remote server via an HTTP connection.

To run the examples in this notebook a server must be available to handle the API requests. During the workshop, such a server will be provided with the host IP address given below.

Technical documentation on T-Res can be found [here](https://living-with-machines.github.io/T-Res/index.html). Developers may also find the [API docs](http://20.0.184.45:8000/v2/t-res_deezy_reldisamb-wpubl-wmtops/docs) useful.

## Setup

Let's begin by importing some Python libraries:

In [196]:
import requests
import operator
from typing import Optional
from dataclasses import dataclass
from dacite import from_dict

Next, we specify the hostname and URL for connecting to the server running the T-Res API.

In [197]:
HOST = "20.0.184.45"
API_URL = f"http://{HOST}:8000/v2/t-res_deezy_reldisamb-wpubl-wmtops"

The following Helper functions will make it easy to call the T-Res API and handle the response:

In [284]:
@dataclass
class Toponym:
    mention: str
    sentence: str
    pos: int
    end_pos: int
    tag: str
    prediction: str
    cross_cand_score: dict
    latlon: Optional[list]
    wkdt_class: Optional[str]
    string_match_score: dict
    
    def __str__(self):
        toponym = self.toponym()
        s = f"Toponym:\t{toponym}"
        if self.mention != toponym:
            s += f"\nMention:\t{self.mention}"
        if self.tag != 'LOC':
            s += f"\nTag:\t\t{self.tag}"
        if self.prediction != 'NIL':
            s += f"\nWikidata link:\thttps://www.wikidata.org/wiki/{self.prediction}"
        s += f"\nCoordinates:\t{self.latlon}"
        if self.prediction in self.cross_cand_score.keys():
            s += f"\nLinking score:\t{self.cross_cand_score[self.prediction]}"
        return s

    def __repr__(self):
        return self.__str__()
    
    def toponym(self):
        if not self.string_match_score:
            return None
        # Identify the best string match.
        d = {i[0]: i[1][0] for i in self.string_match_score.items()}
        return max(d.items(), key=operator.itemgetter(1))[0]

class Toponyms:
    toponyms: list

    def __init__(self, data):
        if not isinstance(data, list):
            raise ValueError("Toponyms data must be a list.")
        self.toponyms = [from_dict(data_class=Toponym, data=t) for t in data]

    def __str__(self):
        if not self.toponyms:
            return "Empty list of toponyms."
        return '\n\n'.join([t.__str__() for t in self.toponyms])

    def __repr__(self):
        return self.__str__()

def validate_query(query):
    if not "text" in query.keys():
        raise ValueError("T-Res API query must contain an item named `text`")
    return

def call_api(query, parse = True):
    validate_query(query)
    response = requests.get(f'{API_URL}/toponym_resolution', json=query)
    if not parse:
        return response
    return parse_api_response(response)

def parse_api_response(response):
    if response.status_code != 200:
        print(f"HTTP error code: {response.status_code}")
        print(f"Reason: {response.reason}")
    result = Toponyms(response.json())
    if len(result.toponyms) == 1:
        return result.toponyms[0]
    return result

## Toponym resolution examples

We're now ready to query the API by sending chunks of input text. The following examples use snippets of newspaper articles drawn from the "Living with Machines" digitised collection.

### Simple toponym resolution from text

A simple toponym resolution query involves submitting a passage of plain text the T-Res API. 

The response (after parsing) includes the ID of the best match in the Wikidata knowledgebase, its latitude-longitude coordinates and the linking score (between 0 and 1), which can be interpreted as a measure of confidence in the accuracy of the match.

Here are two examples:

In [None]:
# Source: The Herald of Wales and Monmouthshire Recorder, 1884-05-24.
query = {"text": "He was sure there was no man in the profession in Swansea who could be more honoured end trusted and more highly respected by his professional brethren than Mr. Davies, being in every way such as a solicitor ought to be."}
call_api(query)

In [None]:
# Source: The Dewsbury Chronicle, and West Riding Advertiser, 1882-04-29.
query = {"text": "It was also expected that Mr. C. Beckett Denison and other prominent members of the Conservative party would take part in the demonstration, and an effort will be made to obtain the presence of Viscount Cranbrooke, Mr. J. Lowther, M.P , and the Hon. G. C. Dawnay, M.P., for the North Riding."}
call_api(query)

### Example with multiple toponyms

If there is more than one toponym in the text T-Res returns the results in a list, as in this example:

In [None]:
# Source: The Blackpool Herald, 1891-08-28.
query = {"text": "On the same morning Captain Seed and the men of the steamer Bickerstaffe,. succeeded in saving the crew of the schooner Gefion, of Norway, which struck on the Spencer Bank, off Southport."}
call_api(query)

Note that the sentence contains several other proper nouns (Captain Seed, Bickerstaffe, Gefion) that are not toponyms, which T-Res correctly ignores.

There is another one, "Spencer Bank", which could be considered an edge case. It refers to an elevation of the sea floor off the coast of Southport. T-Res does not identify this as a toponym.

The named entity "Bickerstaffe" is interesting, in that it refers to a ship in the above example but is also the name of a nearby village in Lancashire. T-Res is able to distinguish between these two cases from the context. Indeed, if we slightly modify the sentence the same named entity is identified as a toponym:

In [None]:
# Source: The Blackpool Herald, 1891-08-28. *Modified*
query = {"text": "On the same morning Captain Seed and the men on their way to Bickerstaffe succeeded in saving the crew of the schooner Gefion, of Norway, which struck on the Spencer Bank, off Southport."}
call_api(query)

### Example with place of publication information

If the place of publication is known, this information can be included in the query. This additional metadata is used by T-Res to improve the quality of the entity linking step.

In this example the text is identical to an earlier query, but this time time we also include the name of the town (Blackpool) in which the newspaper was published *and* its Wikidata ID.

We see that the confidence of the link is significantly improved in the case of Southport (a town close to Blackpool), while for Norway the linking score is approximately unchanged.

In [None]:
# Source: The Blackpool Herald, 1891-08-28.
query = {
    "text": "On the same morning Captain Seed and the men of the steamer Bickerstaffe,. succeeded in saving the crew of the schooner Gefion, of Norway, which struck on the Spencer Bank, off Southport.",
    "place": "Blackpool, Lancashire, England",
    "place_wqid": "Q170377"
    }
call_api(query)

### Example of successful linking despite OCR error

T-Res uses fuzzy string matching to identify named entities, so it may be able to find the correct link even when the digitisation process has introduced a misspelled toponym.

In [None]:
query = {"text": "A remarkable case of rattening has just occurred in the building trade at Shefiield."}
call_api(query)

### Example of a toponym of type 'BUILDING'

Toponyms in T-Res are labelled with a "Tag" property, referring to the type of the named entity.

The most common tag is "LOC" (for location) but if the best match is a specific building, as in the following example, this will be reflected in the "Tag":

In [None]:
query = {"text": "A large crowd gathered, and plenty of volunteers aided in the work of rescue, whilst ambulances and stretchers were fetched from the Middlesex Hospital."}
call_api(query)

### Example of a place linking error

In this example, three toponyms are identified but not all of them are successfully linked to places in the knowledgebase.

In [None]:
# Source: The Stourbridge Observer, 1882-12-09.
query = {
    "text": "Early on Monday morning a fire was discovered ois the premises of Mr. .Toseph Boyle, woollen manufacturer, Prospect Mill, at Longwood, near Huddersfield, and before the flames could be extinguished they had done damage to the extent of about ¬£16,000 or ¬£17,000.",
    "place": "Stourbridge, West Midlands, England",
    "place_wqid": "Q661707"
    }
result = call_api(query)
result

The first named entity is "Prospect Mill", which is correctly tagged as a BUILDING but no match is found in Wikidata. 

The third entity is "Huddersfield", which is correctly linked to Wikidata ID [Q201812](https://www.wikidata.org/wiki/Q201812).

The second entity, "Longwood", is an example of a place linking error. It is linked (erroneously) to Wikidata ID [Q6674497](https://www.wikidata.org/wiki/Q6674497), which is a neighbourhood in New York City. It is clear from the text that the correct match for Longwood is [Q6674506](https://www.wikidata.org/wiki/Q6674506), a suburb of Huddersfield, West Yorkshire.

The reason for this error is that Longwood, New York, is a more prominent entry in Wikidata (specifically, there are more links to that entry than to other places with a similar name) and this information is used by T-Res when seeking the most probable link. We can look at the scores assigned by T-Res to each candidate place, by printing the `cross_cand_score` attribute for that toponym:

In [None]:
result.toponyms[1].cross_cand_score

Here we see that the correct link (Q6674506) scored 0.19, while the incorrect one (Q6674497) scored 0.32.

This example demonstrates that T-Res, like any automated (or indeed manual) system for place linking, is not perfectly accurate. Improvements to the place linking algorithm are part of the current development effort on T-Res.

## Your turn

Using the examples above, try out the T-Res API with your own text inputs. These could come from the open newspapers collection made available earlier in the workshop, or from any other source you like!