# Chapter 05

## Metadata

In the previous chapter I consciously steered the discussion away from information models to data models. Whereas Information Retrieval specifically deals with retrieving information (documents, multimedia, ...), Information Science focusses more on data or, better said, **metadata**.

[Wikipedia](https://en.wikipedia.org/wiki/Metadata) says:

> Metadata is "data that provides information about other data". In other words, it is "data about data." Many distinct types of metadata exist, including descriptive metadata, structural metadata, (...)

Descriptive metadata, for instance, is descriptive information about a resource. It is used for discovery and identification. If we think about books, for example, descriptive metadata would include elements such as title, author, date of publication, etcetera. Or, Exif (Exchangeable image file format) is a metadata standard that specifies the formats for images, sound, and ancillary tags used by digital cameras (including smartphones), scanners and other systems handling image and sound files recorded by digital cameras.

## It's complicated

However, this definition of metadata as "data about data" somewhat misses the mark. In order to illustrate that, we need to take a look at a concrete example of metadata.

Have a look at this title page:

![Justus Lipsius's Poliorcetica](Lipsius.jpg)


The STCV catalogue, which we have already mentioned a few times, catalogues this book as follows ([permalink](https://anet.be/record/stcvopac/c:stcv:7081813/E)):

Category | Metadata
--- | ---
Title |	Title page: Poliorceticωn sive De machinis. Tormentis. Telis. Libri qvinqve
Author | Title page: Ivstus Lipsius \[Lipsius, Justus]
. | External: van Veen, Otto \[Illustrator]
. | External: van der Borcht, Pieter I \[Illustrator]
Publication	| Title page: ex officina Plantiniana, apud viduam, & Ioannem Moretum \[Rivière, Jeanne & Jan I Moretus]
. | Title page: Antverpiæ \[Antwerp]
. | 1596
Language | Latin \[Target language]

Now let's compare this to the title page.

- For instance, we see that the difference between lower case, upper case and small caps has disappeared. This might seem trivial, but did you realise that in Renaissance Latin a capitalized final "I" stands for "ii"? So the full title is actually, "Ivsti Lipsii..." Or, suppose that someone is studying the use of capitalization in title layout as a marketing means. This is vital information that is missing.

- Moreover, when we focus on the title, we see small changes. Whereas the title page does not have a space between "Tormentis.Telis", the transcription does. And it also leaves out the italic parts (another piece of layout info missing!) "Ad Historiarum lucem" (In light of History) and "Cum Privilegiis Caesareo & Regio" (With Imperial and Royal Privilege). Again, such information could be very interesting to book historians.

- Also, when we look at the date, we notice this has been interpreted rather than transcribed. The Roman numeral (with the typical dots) has been silently turned into '1596'.

- Furthermore, we see that copy specific information, such as the stamp in the upper right corner and the pasted on inscription on the bottom have also been left out. 

- On the other hand, there is also more information in the descriptive metadata than is on the title page. The names of the illustrators, for instance, and the name of Christopher Plantin's widow are added.

Let's compare this with the entry in [Worldcat](http://www.worldcat.org/oclc/79260741):

Ivsti LipsI Poliorcetic\[o]n, sive, De machinis, tormentis, telis, libri qvinqve : ad historiarum lucem.

Category | Metadata
--- | ---
Author | Justus Lipsius; Petrus van der Borcht
Publisher |	Antverpiæ : Ex Officina Plantiniana, apud Viduam, & Ioannem Moretum, M.D. XCVI \[1596]

The information is pretty similar, but now that we are tuned into some of the subtleties we notice the differences. "Ad historiarum lucem" is present, for instance. On the other hand, Worldcat has different capitalization and provides only one illustrator, without specifying that this is external information (i.e. not included in the title page).

This example shows that different catalogues adhere to different cataloguing rules. It is impossible to simply catalogue at book title (or indeed any item, be it physical or digital) "as is". However diplomatic and inclusive you try to be when cataloguing, you will always have to make hard decisions about how to handle layout, how to transcribe characters, whether or not to standardize spelling, punctuation, etcetera. 

All of this is perfectly understandable and in general (good) catalogues will be explicit and very scrupulous in the cataloguing rules they follow. The danger is that when we leave the cataloguing context and, for instance, acquire catalogue information in a data dump (STCV is freely available [here](https://www.uantwerpen.be/nl/projecten/anet/open-data/)) we tend to forget this and take the metadata at face value.

Imagine, for a minute, that you hadn't seen the above title page, but merely got the STCV metadata from a SQL query. How accurate would your understanding of this title page actually be? And what happens when, as good DH research is bound to do, you break open metadata containers and aggregate metadata, for instance merging several of the national "short-title catalogue" initiatives (STCV, STCN, ESTC, USTC, ...), which all adhere to different rules?

To make matters worse, our example was a very simple one really. There are many, many more complex metadata problems. Just to give you a taste:

- How would you catalogue one of those toddler squeeky books that feature not a single word of text?
- When in 1993 Princed changed his stage name to the unpronounceable symbol ![Prince](prince.png) (known to fans as the "Love Symbol"), and was sometimes referred to as the Artist Formerly Known as Prince or simply the Artist, how were record shops supposed to catalogue his albums? Remember, in those days, most people would go up to the "P" section and browse for "Prince"!
- Or what about IMDB website listing the actors of the Blair Witch Project as "missing, presumed dead" in the first year of the film's availability (see [this](https://www.telegraph.co.uk/films/2016/07/25/why-did-the-world-think-the-blair-witch-project-really-happened/) article)?


prince symbol or the black album!