Skip to content

Latest commit

 

History

History
93 lines (64 loc) · 7.22 KB

CTS-Metadata-Design.md

File metadata and controls

93 lines (64 loc) · 7.22 KB

Background Information

CTS Text Inventories identify the textgroups, works, versions and citation schemes of the documents contained in a CTS repository, and supply accompanying metadata about them.

Currently at Perseus these inventories are managed separately from the source TEI files and also from the CITE Collections which are the sources for the textgroups, works, and version metadata.

The Perseus Catalog makes CTS Text Inventories available in ATOM feeds at the textgroup, work and version level. The CTS Inventories in these feeds are generated automatically from the data in the CITE collections and do not (or very soon will not) contain the cts:online element with the citation mapping information for the actual TEI versions.

We are currently in a state of transition with regard to how the citation information for the TEI files is obtained. The RefsDecl elements in the P4 versions of the TEI XML often contain multiple, sometimes competing, citation schemes, and are not CTS-compatible (both because they use milestones rather than nodes to identify the citation scheme elements, and because CTS requires a single canonical scheme for a text). The CTS Text Inventory generated by/for the Perseus 4 Hopper CTS API just reports whatever is defined as the default for the text, usually the first one, and does not supply XPath information.

We have been focusing on making all of the Perseus TEI XML CTS compliant, i.e. with one citation scheme taking precedence in the node hierarchy of the XML. The texts that are in the PerseusDL/canonical repo are currently in various states with regard to that, but the RefsDecl in the teiHeader elements have not been kept up to date with the changes. We have only been using the CTS Text Inventories for deployed CTS repos, namely the perseids pilots.xml and annotsrc.xml to describe the citation schemes used.

In Perseids currently when you add a text to your workspace for editing in Perseids, a copy of the TextInventory at the time the text was checked out is added to the publication, and carried along with it. This enables the CTS extension API in Perseids to do passage-based editing and annotating of the text, and to identify of translations available for a text. However, currently the inventories are not automatically updated when new translations or transcriptions are added (whether in an workspace or when a publication is approved and merged back in with the master repo).

The Alpheios CTS 3 XQuery CTS API code operates on full text inventories as the authority for the CTS metadata for the texts contained in the repository.

The CTS 5 XQuery CTS API code generates a TextInventory for the repo on the fly, building it up from textgroup, work, and version level inventory components kept with the source data files.

Some User Stories

  1. As an admin, I want to be able to automatically build and deploy a CTS repository containing all of the Perseus texts, using the Catalog CITE collections to supply the metadata for textgroups, versions and works, and the TEI XML files from the GitHub repo for the citation mapping information and text data.

  2. As an admin, I want to be able to automatically build and deploy a CTS repository containing predefined subsets of the Perseus texts, using the Catalog CITE collections to supply the metadata for textgroups, versions and works, and the TEI XML files from the GitHub repo for the citation mapping information and text data.

  3. As an individual scholar, I want to browse the Perseus Catalog, select the texts I want and export a fully functional standalone CTS Repository of just those texts from the Perseus GitHub repo..

  4. As an individual scholar, I want to be able to provide a configuration file containing the CTS Urns of the texts and generate a fully functional standalone CTS Repository of those texts from the Perseus GitHub repo and metadata derived from the Perseus Catalog

  5. TODO - we need a few user stories around editing and managing different "versions" of a CTS edition or translation ...

Design Notes

The RefsDecl in the teiHeader of the TEI XML files must accurately represent the canonical citation scheme of the document. If we use the cRefPattern syntax for this element to define the Xpaths and citation mapping patterns it should allow us to be able to automatically contruct the CTS citationMapping components of the CTS TextInventory for any given text.

Example from http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SACR:

<refsDecl xml:id="biblical">
 <cRefPattern matchPattern="(.+) (.+):(.+)"
  replacementPattern="#xpath(//div[@n='$1']/div[$2]/div[$3])">
  <p>This pointer pattern extracts and references the <q>book,</q>
   <q>chapter,</q> and <q>verse</q> parts of a biblical reference.</p>
 </cRefPattern>
 <cRefPattern matchPattern="(.+) (.+)"
  replacementPattern="#xpath(//div[@n='$1']/div[$2])">
  <p>This pointer pattern extracts and references the <q>book</q> and
  <q>chapter</q> parts of a biblical reference.</p>
 </cRefPattern>
 <cRefPattern matchPattern="(.+)"
  replacementPattern="#xpath(//div[@n='$1'])">
  <p>This pointer pattern extracts and references just the <q>book</q>
     part of a biblical reference.</p>
 </cRefPattern>
</refsDecl>

So example for e.g. Homer Iliad:

Citation Mapping:

 <citationMapping>
                        <citation label="Book" xpath="/div1[@n='?']" scope="/TEI.2/text/body">
                            <citation label="Line" xpath="//l[@n='?']"
                                scope="/TEI.2/text/body/div1[@n='?']"/>
                        </citation>
                    </citationMapping>

And as CREF

<refsDecl xml:id="CTS">
 <cRefPattern matchPattern="(.+).(.+)"
  replacementPattern="#xpath(/TEI.2/text/body/div[@n='$1']//l[@n='$2'])">
  <p>This pointer pattern extracts book and line</p>
 </cRefPattern>
 <cRefPattern matchPattern="(.+)"
  replacementPattern="#xpath(/TEI.2/text/body/div[@n='$1'])">
  <p>This pointer pattern extracts book.</p>
 </cRefPattern>
</refsDecl>

I think ideally, a CTS API implementation should be able to work with both a traditional full CTS inventory, as well as with directory structure conventions, metadata fragments, and the TEI RefsDecl header. There are several reasons for supporting the naming convention/metadata fragment approach:

  1. As the number of texts increases, a single large XML file containing the entire inventory becomes unmanageable
  2. Texts can be added and removed easily without requiring updates to a master file.
  3. It avoids redundancy and duplication of information, particularly with regard to the citation mapping information which should be contained in the TEI XML file anyway.