Skip to content

CristianCantoro/thes2loc

Repository files navigation

thes2loc

Italian Thesaurus terms to Library of Congress Subject Headings via Wikidata.

Why

thes2loc helps librarian building a multilingual Thesarus in particular it finds a mapping between the Italian Tesauro del Nuovo Soggettario (THES) from the Biblioteca Nazionale di Firenze (National Library of Florence) and the Library of Congress Subject Headings (LCSH).

The BNCF Thesaurus links Wikipedia article in some of it s terms, see for example: Abbazie (Abbeys) which links to the Italian Wikipedia article Abbazia (Abbey). On May 2013 the Italian Wikipedia community created a template {{BNCF Thesaurus}} to link back this terms and inserted the data also in Wikidata, creating the property: P:508, i.e. BNCF Thesaurus

For the mapping between the Library of Congress Subject Headings and English Wikipedia articles it uses this mapping by John Ockerbloom: wikimap.

Thus a map THES <-> LCSH is built in this way:

THES <-> itwiki <-> wikidata <-> enwiki <-> LCSH

Requirements

thes2loc has only been tested on Ubuntu Linux so far. it should work also on other *nix systems. To run it you need the following software as prerequisites:

  • curl, this comes pre-packaged with most desktop Linux distributions.
  • jq, a powerful CLI tool for processing JSON. You can download it from the project's website.
  • GNU parallel, a shell tool for executing jobs in parallel. GNU parallel is packaged on several Linux distributions.
  • pywikibot, a python framework to interact with MediaWiki wikis and in particular with the Wikimedia projects.

Usage

USAGE:

make all

produces (among others) the file thes2lcsh.map which is what you are interested in.

This command comprises three other commands:

  • make get: retrieves data from Wikidata (list of items with property BNCF Thesaurus) and from Wikimap (LCHS -> enwiki article titles)

  • make resolve: retrieves data from Wikidata (BNCF Thesaurus item id, itwiki article title, enwiki aticle title)

  • make match: builds thes2lsch.map with (BNCF Thesaurus item id, relation type, LCHS id, Wikidata item no.)

To retrieve the corresponding URLs from the file thes2lcsh.map use the following mapping:

  • column in thes2lcsh.map are

(thes_id, relation, lcsh_id, wikidata_id)

where:

  1. thes_id is the BNCF Thesaurus term identifier;

  2. relation is the relation type (as defined by John Ockerbloom's classification, see the documentation;

  3. lcsh_id is the Library of Congress Subject Heading identifier;

  4. wikidata_id is the Wikidata item no. (e. g. 42 for Q42);

To retrieve the corresponding URLs from thes2lcsh.map use the following mapping:

  • for BNCF Thesaurus: http://thes.bncf.firenze.sbn.it/termine.php?id={thes_id}

  • for LCSH: http://id.loc.gov/authorities/subjects/{lhcs_id}.html

  • for Wikidata: http://www.wikidata.org/wiki/Q{wikidata_id}

Inspired by this Gist by @atomotic.

License

This software is released under the MIT license. It is free software. (c) 2014 by Cristian Consonni

About

Italian Thesaurus to Library of Congress Subject Headings via Wikidata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published