Reconciliation

Thad Guidry edited this page Sep 12, 2018 · 40 revisions

This page is also available in Japanese.


NOTE: OpenRefine Reconciliation does NOT work against a SPARQL endpoint at this time, but instead only a Reconciliation Service API.


Reconciliation is a semi-automated process of matching text names to database IDs (keys). This is semi-automated because in some cases, machine alone is not sufficient and human judgement is essential. For example, given "Ocean's Eleven" as the name of a film, should it be matched to

While we talk about "databases" here, perhaps a more accurate term would be "name registries". And conceptually, "reconciliation" isn't a new concept. In some fields, it's being done all the time. For example, when the police gets a name of a suspect, they run the name through their databases of known felons. More details about the suspect (e.g., race, height, last known address) would help the police narrow down the suspect among several database matches.

You can use OpenRefine to perform reconciliation of names in your data against any database that exposes a web service following this Reconciliation Service API specification. One such database is Wikidata, and in this document we will use it as our example service.

Basics

Given a column containing names, we can reconcile those names against Wikidata. To do so, invoke the column's drop-down menu and pick Reconcile > Start reconciling. If you want to reconcile only some cells in that column, then use filters and facets to isolate them first.

Screenshot of the reconciliation menu

In the Reconcile dialog box, pick Wikidata Reconciliation Service on the left. That service comes by default, but you can install more services (see below to get a version of this service in your language).

There are a few options to configure the reconciliation process, which will help you get more reliable matches. These options are covered in the next sections. To start the reconciliation process, click the Start Reconciling button.

Screenshot of the reconciliation dialog

This process will take a while depending on how much data you have. Currently, the Wikidata reconciliation service processes about 3 rows per second.

When the process is done, you will see that the reconciliation data in the cells.

  • Either the cell was successfully matched: it displays a single dark blue link. In this case, the reconciliation is confident that the match is correct: you should not have to check it manually.

(Screenshot of a matched item)

  • Or a few candidates are displayed, together with their reconciliation score, with light blue links. You need to pick manually the correct one. For each matching decision you make, you have two options: either match this cell only (button to perform a single match), or also use the same identifier for all other cells containing the same unreconciled value (button to perform a multiple match).

(Screenshot of an unmatched item)

In addition to changing the ways the reconciled cells are displayed, OpenRefine also automatically creates two facets for you to use to filter the cells based on the reconciliation data. One is a numeric facet for the reconciliation scores of the best candidates of the cells. Higher scores mean better matches. You could filter for the higher scores, and approve them all in bulk (invoking that column's drop-down menu and using the Reconcile > Actions submenu).

There is also the judgment facet, which lets you filter for the cells that haven't been matched (pick "None" in the facet). As you process each cell, its judgment changes from "None" to "Matched" and it disappears from the view, because it no longer fits the facet's selection.

Restricting matches by type

One simple way to improve the quality of the matches is to restrict them to a given type. If your dataset contains universities, then you know that "Heidelberg" does not refer to the city of Heidelberg (Q2966) but rather the university of Heidelberg (Q151510).

In Wikidata, types are items themselves. For instance, the university of Ljubljana (Q1377) has type public university (Q875538), which is witnessed by a statement on Q1377 using the instance of (P31) property.

Types are organized into an ontology, by specifying which types are subclasses of the others, using the subclass of (P279) property. For instance, public university (Q875538) is a subclass of university (Q3918).

(diagram explaining types in Wikidata)

One useful tool to visualize the subclasses of a given type is the Wikidata Graph Builder. For instance, here are all the subclasses of academic institution (Q4671277).

The reconciliation dialog allows you to select a type for your rows. This will restrict the matches to items which are instances of any subclass of the given type. For instance, if you select public institution (Q875538), then university of Ljubljana (Q1377) will be a possible match, because of the path depicted above.

By default, OpenRefine will propose you some types based on the first few items of your dataset. It is quite likely that these types will be too specific for your needs. You can specify a broader type that will encompass more items. For instance, if you have a dataset of universities, it might be better to pick educational institution (Q2385804) rather than university (Q3918), for instance if some entry happens to be marked as a college (Q189004) in Wikidata.

Some items are not marked as the instances of anything, because no one has taken the time to add these statements yet. If you restrict the reconciliation to a type, these items will not appear in the results, except when no items of the given type could be found for the particular name. In this case, items with no type can be returned: they will have a low score and will not be matched automatically.

Refining by property

Names are ambiguous, and sometimes your dataset contains other columns that can help identify which item is being referred to. The reconciliation interface can be configured to take these columns into account in the matching scores. For instance, say you have a dataset of sportspeople with a column indicating which sport they play.

(example dataset)

If you want to use the second column in the reconciliation process, you need to identify how this additional information is represented in Wikidata. So you need to find out which Wikidata property is used to link sportspeople to the sport they play. Looking at Usain Bolt (Q1189) it seems that sport (P641) can be used for that.

In the reconciliation dialog, tick the box to include your column and bind it to the relevant Wikidata property.

(dialog to bind a column to a property)

The reconciliation service will then fuzzy-match on both columns.

(example dataset)

As you can see, the score is boosted when the sport matches. In the first row, there is a slight mismatch between the sport in the table ("rugby") and on Wikidata ("rugby league"), which explains why the first match does not reach the perfect score (100) and is therefore not matched automatically.

Advanced usage: property paths

This feature is specific to the reconciliation interface for Wikidata.

Sometimes, the relation between the reconciled item and the disambiguating column is not direct: it is not represented as a property itself. Let us consider this dataset of cities:

(an example dataset with cities and country codes)

To fetch the country code from an item representing a city, you need to follow two properties. First, follow country (P17) to get to the item for the country in which this city is located, then follow ISO 3166-1 alpha-2 code (P297) to get the two-letter code string.

(an example dataset with cities and country codes)

This is supported by the reconciliation interface, with a syntax inspired by SPARQL property paths: just type the sequence of property identifiers separated by slashes, such as P17/P297:

(inputting a property path in the reconciliation interface)

This additional information allows to distinguish between namesakes, to some extent. As "Cambridge, US" is still ambiguous, there are multiple items with a perfect matching score, but "Oxford, GB" successfully disambiguates one particular city from its namesakes. (the disambiguated dataset)

The endpoint currently supports two property combinators: /, to concatenate two paths as above, and |, to compute the union of the values yielded by two paths. Concatenation / has precedence over disjunction |. The dot character . can be used to denote the empty path. For instance, the following property paths are equivalent:

  • P17|P749/P17
  • P17|(P749/P17)
  • (.|P749)/P17

They fetch the country (P17) of an item or that of its parent organization (P749).

Labels, Descriptions, Aliases

Terms can be accessed as follows:

(L for label , D for description, A for aliases)

  • Len for English label
  • Dfi for Finnish description
  • Apt for Portuguese aliases

The lowercase letters are Wikimedia language codes which select in which language the terms will be fetched. No language fall-back is performed when retrieving the values.

Comparing values

By default, the values supplied in OpenRefine and the ones present in Wikidata are compared by string fuzzy-matching. There are some exceptions to this:

  • If the value is an identifier, then exact string matching is used.
  • If the values are integers, exact equality between integers is used.
  • If the values are floating point numbers, the score is 100 if they are equal and decreases towards 0 as their absolute difference increases.
  • If the values are coordinates (specified in the "lat,lng" format on OpenRefine's side), then the matching score is 100 when they are equal and decreases as their distance increases. Currently a score of 0 is reached when the points are 1km away from each other.

Sometimes, we need a more specific matching on sub-parts of these values. It is possible to select these parts for matching by appending a modifier at the end of the property path:

  • @lat and @lng: latitude and longitude of geographical coordinates (float)
  • @year, @month, @day, @hour, @minute and @second: parts of a time value (int). They are returned only if the precision of the Wikidata value is good enough to define them.
  • @isodate: returns a date in the ISO format 1987-08-23 (string). A value is always returned.
  • @iso: returns the date and time in the ISO format 1996-03-17T04:15:00+00:00. A value is always returned. For times and dates, all values are returned in the UTC time zone.
  • @urlscheme ("https"), @netloc ("www.wikidata.org") and @urlpath ("/wiki/Q42") can be used to perform exact matching on parts of URLs.

For instance, if you want to refine people by their birth dates, but you only have the month and day. First, split the birthday dates in two columns, for month and day.

(the initial dataset)

Reconcile using the following parameters (P569 can be found with the autocompletion first):

(the reconciliation parameters: P569@month and P569@day)

Reconciling via unique identifiers

Many Wikidata items come with external unique identifiers, so if you have some in your dataset they can be very useful to get a precise reconciliation. When a Wikidata property corresponding to a unique identifier is used, the reconciliation interface uses a different matching strategy. It first retrieves any items with this identifier and gives it the perfect matching score (100). All the other properties are ignored, including the original name in the column being reconciled. If no item bears the supplied unique identifier, the reconciliation interface falls back on the standard strategy, using string search and fuzzy matching.

Note: Sometimes the same external identifier is assigned to different Wikidata items (in which case the conflict will be flagged in Wikidata as a uniqueness constraint violation). In this case, the multiple items are returned but are not automatically matched.

Note: It is important to supply the unique identifiers in the same format as they are stored on Wikidata. Matching of identifiers is done with strict string equality. Case differences, spaces or leading zeros all matter. Check the documentation of the relevant Wikidata property for the expected format.

This alternative matching strategy is only used for unique identifiers belonging to the reconciled items. For instance, while the ISO 3166-1 alpha-2 code (P297) is a unique identifier for countries, it will not be used as a unique identifier for cities reconciled with the SPARQL property path P17/P297 demonstrated above.

However, disjunctions of unique identifiers can be still used: for instance, suppose you have a column containing sometimes a DOI (P356), and sometimes a CiteSeerX ID (P3784). If you are confident that such ids are not ambiguous (which is the case in this example as the formats are incompatible), then you can reconcile with this column mapped to (P356|P3784). For each row, any item with the given value as either a DOI or a CiteSeerX id will be matched, ignoring the other values.

Fetching values from the reconciled database

One reason to reconcile a dataset to some database is that it allows you to pull data from the database into OpenRefine. There are essentially two ways to do this.

  • If the reconciliation service supports the Data Extension API, then you can use the "Add columns from reconciled values" action to augment your OpenRefine project with new columns: Screencast of the data extension workflow
  • If the reconciliation service does not support this feature, look out for a generic web API for that data source. You can use the "Add column by fetching URLs" operation to call this API with the ids obtained from the reconciliation process. For instance, if you have used the OpenCorporates service, you can use their API to fetch JSON-formatted data about the reconciled companies. The GREL expression "https://api.opencorporates.com/companies/"+cell.recon.match.id will be translated to URLs like https://api.opencorporates.com/companies/17087985, whose content can then be parsed in OpenRefine.

Keep all the suggestions made

To keep all suggestion made by the reconciliation service and not match against the best one:

  1. Before starting reconciliation un-check option Auto-match candidates with high confidence in Reconcile dialog (it's at the bottom).
    1. After reconciliation you should be able to see all three suggested reconciliation candidates. This step is optional - it is just for you to see what will be extracted.
  2. After reconciliation click on the reconciled column and choose Add column based on this column.
    1. Put following expression in the Expression field if you want the value: cell.recon.candidates[0].name
    2. if you want a link to Wikidata use this expression: cell.recon.candidates[0].id
    3. To get all of the IDs at the same time use the expression: cell.recon.candidates.join(',') , which will give you a comma separated list of all the IDs. Then use the command "Split multi-valued cells" to split the identifiers into individual rows.

Language settings for Wikidata reconciliation

You can install a version of the Wikidata reconciliation service that uses your language.

First, you need to find your language code: this is the first letters in the domain name of your Wikipedia (for instance "fr" if your Wikipedia is https://fr.wikipedia.org/wiki/).

Then, open the reconciliation dialog and click Add Standard Service. The URL is https://tools.wmflabs.org/openrefine-wikidata/fr/api where "fr" is replaced by your language code.

When reconciling using this interface, items and properties will be displayed in your language if a translation is available. The matching score of the reconciliation is not influenced by your choice of language: items are matched by considering all labels and keeping the best possible match. So the language of your dataset is irrelevant to the choice of the language for the reconciliation interface.

Additional Resources

See also:

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.