# Chapter 05

## Metadata

In the previous chapter I consciously steered the discussion away from information models to data models. Whereas Information Retrieval specifically deals with retrieving information (documents, multimedia, ...), Information Science focusses more on data or, better said, **metadata**.

[Wikipedia](https://en.wikipedia.org/wiki/Metadata) says:

> Metadata is "data that provides information about other data". In other words, it is "data about data." Many distinct types of metadata exist, including descriptive metadata, structural metadata, (...)

Descriptive metadata, for instance, is descriptive information about a resource. It is used for discovery and identification. If we think about books, for example, descriptive metadata would include elements such as title, author, date of publication, etcetera. Or, Exif (Exchangeable image file format) is a metadata standard that specifies the formats for images, sound, and ancillary tags used by digital cameras (including smartphones), scanners and other systems handling image and sound files recorded by digital cameras.

## It's complicated

However, this definition of metadata as "data about data" somewhat misses the mark. In order to illustrate that, we need to take a look at a concrete example of metadata.

Have a look at this title page:

![Justus Lipsius's Poliorcetica](Lipsius.jpg)


The STCV catalogue, which we have already mentioned a few times, catalogues this book as follows ([permalink](https://anet.be/record/stcvopac/c:stcv:7081813/E)):

Category | Metadata
--- | ---
Title |	Title page: Poliorceticωn sive De machinis. Tormentis. Telis. Libri qvinqve
Author | Title page: Ivstus Lipsius \[Lipsius, Justus]
. | External: van Veen, Otto \[Illustrator]
. | External: van der Borcht, Pieter I \[Illustrator]
Publication	| Title page: ex officina Plantiniana, apud viduam, & Ioannem Moretum \[Rivière, Jeanne & Jan I Moretus]
. | Title page: Antverpiæ \[Antwerp]
. | 1596
Language | Latin \[Target language]

Now let's compare this to the title page.

- For instance, we see that the difference between lower case, upper case and small caps has disappeared. This might seem trivial, but did you realise that in Renaissance Latin a capitalized final "I" stands for "ii"? So the full title is actually, "Ivsti Lipsii..." Or, suppose that someone is studying the use of capitalization in title layout as a marketing means. This is vital information that is missing.

- Moreover, when we focus on the title, we see small changes. Whereas the title page does not have a space between "Tormentis.Telis", the transcription does. And it also leaves out the italic parts (another piece of layout info missing!) "Ad Historiarum lucem" (In light of History) and "Cum Privilegiis Caesareo & Regio" (With Imperial and Royal Privilege). Again, such information could be very interesting to book historians.

- Also, when we look at the date, we notice this has been interpreted rather than transcribed. The Roman numeral (with the typical dots) has been silently turned into '1596'.

- Furthermore, we see that copy specific information, such as the stamp in the upper right corner and the pasted on inscription on the bottom have also been left out. 

- On the other hand, there is also more information in the descriptive metadata than is on the title page. The names of the illustrators, for instance, and the name of Christopher Plantin's widow are added.

Let's compare this with the entry in [Worldcat](http://www.worldcat.org/oclc/79260741):

Category | Metadata
--- | ---
Title | Ivsti LipsI Poliorcetic\[o]n, sive, De machinis, tormentis, telis, libri qvinqve : ad historiarum lucem.
Author | Justus Lipsius; Petrus van der Borcht
Publisher |	Antverpiæ : Ex Officina Plantiniana, apud Viduam, & Ioannem Moretum, M.D. XCVI \[1596]

The information is pretty similar, but now that we are tuned into some of the subtleties we notice the differences. "Ad historiarum lucem" is present, for instance. On the other hand, Worldcat has different capitalization and provides only one illustrator, without specifying that this is external information (i.e. not included in the title page).

This example shows that different catalogues adhere to different cataloguing rules. It is impossible to simply catalogue at book title (or indeed any item, be it physical or digital) "as is". However diplomatic and inclusive you try to be when cataloguing, you will always have to make hard decisions about how to handle layout, how to transcribe characters, whether or not to standardize spelling, punctuation, etcetera. 

All of this is perfectly understandable and in general (good) catalogues will be explicit and very scrupulous in the cataloguing rules they follow. The danger is that when we leave the cataloguing context and, for instance, acquire catalogue information in a data dump (STCV is freely available [here](https://www.uantwerpen.be/nl/projecten/anet/open-data/)) we tend to forget this and take the metadata at face value.

Imagine, for a minute, that you hadn't seen the above title page, but merely got the STCV metadata from a SQL query. How accurate would your understanding of this title page actually be? And what happens when, as good DH research is bound to do, you break open metadata containers and aggregate metadata, for instance merging several of the national "short-title catalogue" initiatives (STCV, STCN, ESTC, USTC, ...), which all adhere to different rules?

To make matters worse, our example was a very simple one really. There are many, many more complex metadata problems. Just to give you a taste:

- How would you catalogue one of those toddler squeeky books that feature not a single word of text?
- When in 1993 Princed changed his stage name to the unpronounceable symbol ![Prince](prince.png) (known to fans as the "Love Symbol"), and was sometimes referred to as the Artist Formerly Known as Prince or simply the Artist, how were record shops supposed to catalogue his albums? Remember, in those days, most people would go up to the "P" section and browse for "Prince"!
- Or what about the IMDB website listing the actors of the Blair Witch Project as "missing, presumed dead" in the first year of the film's availability (see [this](https://web.archive.org/web/20170109185339/http://www.telegraph.co.uk/films/2016/07/25/why-did-the-world-think-the-blair-witch-project-really-happened/) article)?


## Metadata 101

We now have a better understanding of metadat. As a former student of mine [@Karolingva]() once tweeted from a [RightsCon](https://www.rightscon.org/) conference: metadata is not data about data, but

- created data about data
- by humans
- with a purpose
- according to certain standards

This means than when working with metadata we need to be acutely aware of this context.

In short, that means having an understanding of metadata rules, standards and exports.

### Rules

General descriptive book cataloging
RDA (Resource Description and Access)
http://access.rdatoolkit.org/ (subscription needed)
E.g. Names with articles = ‘The Hague’ NOT ‘Hague, the’

Special collections
DCRM(B) (Descriptive Cataloging of Rare Materials - Books)
http://rbms.info/dcrm/dcrmb/ (open source)
E.g. no spaces for abbreviations = ‘Ad S.R.E. Cardinalem…’ 
	EXCEPT multiple letter-abbreviations = ‘Ad Ph. D. Jacobum…’


### Standards

Books: MARC
Machine-Readable Cataloging (MARC21)
https://www.loc.gov/marc/ 

Archives: EAD
Encoded Archival Description (XML standard)
http://www.loc.gov/ead/ 

Objects: Dublin Core
Dublin Core Metadata Initiative
http://www.dublincore.org/ 

### Exports

txt, csv, json, xml, ...

CERL thesaurus = https://thesaurus.cerl.org (place name and personal names in Europe in the https://thesaurus.cerl.org period of hand press printing, c. 1450 - c. 1830) (paywall) (linked data in RDF + SRU protocol) 

RKD – Netherlands Institute for Art History = https://rkd.nl/en/

Biografisch Portaal van Nederland = http://www.biografischportaal.nl/ 

Deutsche Biographie = https://www.deutsche-biographie.de/ 

STCV = www.stcv.be (exports HTML and TAG, open data – ask Goran!)
STCN = http://picarta.nl/DB=3.11/LNG=EN 
VD16/VD17 = http://www.gateway-bayern.de/index_vd16.html (up to 999 records as CSV and HTML)
EDIT16 = http://edit16.iccu.sbn.it/ 
ESTC = http://estc.bl.uk/ 
HPB = http://hpb.cerl.org/ (paywall) (SRU)
USTC = http://www.ustc.ac.uk/ 
ISTC = http://www.bl.uk/catalogues/istc/ 
Worldcat.org = www.worldcat.org (exports, plugins, API)

Material Evidence in Incunabula = 
http://data.cerl.org/mei/_search  (paywall)
see also http://textinc.bodleian.ox.ac.uk/
Digitalisierung der Durchreibungen von Bucheinbanden des 15. und 16. Jahrhunderts = http://www.hist-einband.de/ 
Dutch Bookbindings from the KB = http://www.geheugenvannederland.nl/en/geheugen/pages/collectie/Boekbanden+van+de+Koninklijke+Bibliotheek 
or https://commons.wikimedia.org/wiki/Category:Bookbindings_from_Koninklijke_Bibliotheek 

The Thomas L. Gravell Watermark Collection = http://www.gravell.org/
Wasserzeichen-Informationssystem = 
http://www.wasserzeichen-online.de/wzis/index.php
Wasserzeichenkartei Piccard = 
http://www.piccard-online.de/ 
Watermarks in Incunabula printed in the Low Countries = http://watermark.kb.nl/default/search/advanced/

TW - Typenrepertorium der Wiegendrucke = 
http://tw.staatsbibliothek-berlin.de/ (XML export)
Early Modern Typography (flickr) = 
https://earlymoderntypography.com/
Typografische Ornamenten Repertorium van Antwerpse Drukkers, 1541-1600 (TORAD) http://zoeken.felixarchief.be/zHome/Home.aspx?id_isad=412841
Printers’ Devices database = http://www.bib.ub.edu/fileadmin/impressors/home_eng.htm
Image-based similarity search from BSB = 
https://www.digitale-sammlungen.de/index.html?c=bildsuche&l=en (browse through similarity + upload of images!)




## DH example

Disclaimer!

Submitted on 28 Jun 2017 (v1), last revised 19 Nov 2018 (this version, v2)]
A Datamining Approach to the Short Title Catalogue Flanders: the Case of Early Modern Quiring Practices
Tom Deneire
This paper contains a data mining approach to the Short Title Catalogue Flanders (this http URL), which aims to record all books printed in Flanders up to 1801 (24.850 editions, per 31/08/2018). More specifically, it aims to analyse the Early Modern practice of 'quiring' gatherings in handpress book production

[link](https://arxiv.org/abs/1706.09406v2)

[pdf](https://arxiv.org/pdf/1706.09406v2.pdf)

## Overview



## Assignment: MARC21 to Dublin Core

OAI-PMH

https://anet.be/oai/catgeneric/server.phtml?verb=GetRecord&metadataPrefix=marc21&identifier=c:lvd:123456

solution = 

https://anet.be/oai/catgeneric/server.phtml?verb=GetRecord&metadataPrefix=oai_dc&identifier=c:lvd:123456