Skip to content

CLARIAH/wp6-missieven

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

General Missives

SWH DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Corpus

This repo contains a structurally clean version of the data of the General Missives, volumes 1-14.

Read more in about.

Rationale for this representation of the corpus

Cleaning a textual dataset is a lot of work. If such a dataset is a standard work, it will be studied by many students/researchers from several disciplines. To make life easier for those people, they should be able to start with a dataset that is readily processable by any tool of their choice.

Text-Fabric provides a data model that captures the data at the end of the cleaning process just before it goes into other tools. It also support the integration of subsequent annotations with the original data.

The Missives corpus is an example how that works.

Getting started

Search interface to-go

For a first impression, start with missieven-search This is a static website that sends the whole corpus to your browser. After a few seconds you can start searching.

You can do full text search via regular expressions, not only in the text, but also in some of its attributes. For example, you can search for a word in original letter texts or in editorial remarks.

More info in the manual.

An example search is in example.json. Download the file, then import it in your search interface, and you see it happening.

You can save search results to excel files.

Text-fabric browser

You get more power when you download Text-Fabric. Text-Fabric operates in the ecosystem of Python and its libraries.

But you do not have to program in order to browse and search the corpus. After installing Python and

pip3 install text-fabric

on the command line, say

tf clariah/wp6-missieven

and a web server on your computer is started which serves you a search-and-browse interface on the Generale Missives corpus. You can search more precisely here than in the search interface-to-go above.

You can save search results to excel files.

Jupyter notebooks

Text-Fabric is particularly suited to Jupyter notebooks. There is a handy way to install Python, JupyterLab in one go and Text-Fabric from there.

The next step is to consult the tutorial. This is a series of notebooks that guides you to the computing facilities of Text-Fabric. Text-Fabric is just a library that you import in your own Python programs, which means that you can invoke the whole of Python and its libraries to do your job. The only thing Text-Fabric does is to offer you a handy computing interface to the textual data and their annotations.

See other corpora for experiences with Text-Fabric as a pre-processing tool in other corpora.

Getting the corpus data

The data of the corpus is in the wp6-missieven repo on GitHub:

  • as simple, TEI-like XML (see the xml directory in this repo)
  • as plain text-fabric files (see the tf directory in this repo)

If you use any method of working with the corpus indicated above, you do not have to do anything special to download the data. If you tell Text-Fabric it is in clariah/wp6-missieven, it can find it and download it when needed. Automatically.

Authors

This repo is by

Acknowledgements

  • Jesse de Does provided TEI-XML files for volumes 1-13.
  • Lodewijk Petram provided textual PDFs for volume 14, bands (i) and (ii).
  • Sophie Arnoult used the Text-Fabric data to perform Named Entity Recognition and delivered the results back.

Status

  • 2022-10-13 updates in Sophie's named entities, improved tutorials, source texts plus entities delivered as valid TEI-XML (see the convert notebook)
  • 2022-09-08 a tutorial notebook is added to show how to use the named entities detected by the algorithm of Sophie.
  • 2022-05-04 version 1.0: Additional volumes: Volume 14, bands (i) and (ii) have been added. The earlier corrections by Sophie have not been re-applied, but the conversion has been improved so that they are not needed anymore.
  • 2022-04-11 Additional volumes: Volume 14, bands (i) and (ii) are in the process of being converted from textual PDF to Text-Fabric. Most structure has been recognized, but no TF has been generated yet.

older ...

Long term preservation and reproducibility

This repo has been archived in two independent places:

Click the respective badges above to be taken to the archives. There you find ways to cite this work.

You can rerun the conversion programs on the source data and regenerate the simple XML and Text-Fabric versions of the data. See the reproduce. guide.

More interfaces

Another version of the data (less cleaned) is visible online in a BlackLab interface

A latent wish is to make the data of this repository available in a BlackLab interface. In this repo we show how to set up a local BlackLab server and front-end and how to get the present data into BlackLab.

This is work in progress, at this point follow the BlackLab install guide for macos.

Thanks to Jesse de Does (key user of BlackLab, INT) and Jan Niestadt (main author of BlackLab, INT) for helping out with setting up and using BlackLab.