# *CitedLoci* pipeline: a step-by-step quick start guide

This notebook will show you step-by-step how to use the *CitedLoci* pipeline to index canonical citations (e.g. Hom. *Il.* 1,1-10) from plain text documents. 

The diagram below indicates the pipeline components that are involved at each step of the process. 

![](imgs/citedloci-pipeline.png)

# Running the pipeline

## Introduction

Upon installation, the Python library `CitationExtractor` will also install a command-line script called `citedloci-pipeline` which allows you execute the various steps of the citation extraction pipeline directly from the command-line.

In [38]:
!citedloci-pipeline --version

No handlers could be found for logger "citation_extractor.crfpp_wrap"
1.7.2


To simplify the pipeline execution, all configuration parameters are stored in the file [`config/project.ini`](config/project.ini), which looks as follows:

In [39]:
!cat config/project.ini

[general]

storage = pickles
working_dir = ./data/


[preproc]

abbreviation_list = data/abbreviations.txt
split_sentences = false
treetagger_home = /home/romanell/tree-tagger/

[ner]

# this is horrible. all configurations should be here, not in separate places
model_settings_dir = config/ner/
model_name = crfsuite

[relex]

[ned]

kb_config = config/hucit.ini
#kb_config = config/hucit_local.ini


To speed up processing, some Python objects that have longer initialisation time (e.g. because they require some training) are already pre-computed and stored as **pickled objects** in the folder [`pickles/`](pickles/).

Input and output data (both intermediate and final) can be found respectively in [`data/orig/`](data/orig/) (input) and [`data/json/`](data/json/) (final JSON output). 

## Pre-processing

Pre-processing is applied to all input text files, and consists of the following operations:
- sentence splitting
- tokenization and part-of-speech tagging (using `TreeTagger`)
- language identification (using `langid`)

The pre-processed files are then written to [`data/iob/`](data/iob/) as an IOB-formatted file. 

In [40]:
%%time
!citedloci-pipeline do preproc --config=config/project.ini

No handlers could be found for logger "citation_extractor.crfpp_wrap"
Wed, 23 Jun 2021 11:55:38 - citation_extractor.Utils.IO - [INFO] Logger initialised
Wed, 23 Jun 2021 11:55:38 - citation_extractor.Utils.IO - [INFO] Current working directory: /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/data
Wed, 23 Jun 2021 11:55:38 - citation_extractor.Utils.IO - [INFO] There are 1 docs to process
Wed, 23 Jun 2021 11:55:38 - citation_extractor.Utils.IO - [INFO] Following documents will be processed: [u'bmcr_2013-01-10.txt']
Wed, 23 Jun 2021 11:55:38 - citation_extractor.Utils.IO - [INFO] Env variable $TREETAGGER_HOME == /home/romanell/tree-tagger/
Wed, 23 Jun 2021 11:55:38 - citation_extractor.Utils.IO - [INFO] Treetagger for fr successfully initialised
Wed, 23 Jun 2021 11:55:38 - citation_extractor.Utils.IO - [INFO] Treetagger for en successfully initialised
Wed, 23 Jun 2021 11:55:38 - citation_extractor.Utils.IO - [INFO] Treetagger for nl successfully initialised
Wed, 23 Jun 2021 

## Named entity recognition (NER)

The NER step is responsible for extracting citation components that can be found in a text. 
Each component is tagged with a named entity tag:
- a mention of "Homer" will be tagged as `<AAUTHOR>Homer</AAUTHOR>` – where `AAUTHOR` means ancient author)
- "*Iliad*" will be tagged as `<AWORK>Iliad</AWORK>` – where `AWORK` means ancient work)
- and "Hom. Il. 1.1-10" will be tagged as `<REFAUWORK>Hom. Il. </REFAUWORK>` and `<REFSCOPE>1.1-10</REFSCOPE>`

Process all files:

In [41]:
!citedloci-pipeline do ner --config=config/project.ini

No handlers could be found for logger "citation_extractor.crfpp_wrap"
Wed, 23 Jun 2021 11:56:00 - citation_extractor.Utils.IO - [INFO] Logger initialised
Wed, 23 Jun 2021 11:56:00 - citation_extractor.Utils.IO - [INFO] Current working directory: /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/data
Wed, 23 Jun 2021 11:56:00 - citation_extractor.Utils.IO - [INFO] There are 1 docs to process
Wed, 23 Jun 2021 11:56:00 - citation_extractor.Utils.IO - [INFO] Following documents will be processed: [u'bmcr_2013-01-10.txt']
Wed, 23 Jun 2021 11:56:00 - citation_extractor.Utils.IO - [INFO] Extractor loaded from pickle /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/pickles/extractor.pkl
Wed, 23 Jun 2021 11:56:02 - citation_extractor.Utils.IO - [INFO] Output successfully written to file
Wed, 23 Jun 2021 11:56:02 - citation_extractor.io.converters - [INFO] Document bmcr_2013-01-10 has 75 sentences
Wed, 23 Jun 2021 11:56:02 - citation_extractor.Utils.IO - [INFO] Finished proc

Instead of the batch processing mode, it's also possible to process one specific document (from input folders):

In [42]:
# this command is equivalent to the one below as there is only that one
# input text document in any case

!citedloci-pipeline do ner --config=config/project.ini --doc=bmcr_2013-01-10.txt

No handlers could be found for logger "citation_extractor.crfpp_wrap"
Wed, 23 Jun 2021 11:56:05 - citation_extractor.Utils.IO - [INFO] Logger initialised
Wed, 23 Jun 2021 11:56:05 - citation_extractor.Utils.IO - [INFO] Current working directory: /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/data
Wed, 23 Jun 2021 11:56:05 - citation_extractor.Utils.IO - [INFO] There are 1 docs to process
Wed, 23 Jun 2021 11:56:05 - citation_extractor.Utils.IO - [INFO] Following documents will be processed: ['bmcr_2013-01-10.txt']
Wed, 23 Jun 2021 11:56:05 - citation_extractor.Utils.IO - [INFO] Extractor loaded from pickle /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/pickles/extractor.pkl
Wed, 23 Jun 2021 11:56:07 - citation_extractor.Utils.IO - [INFO] Output successfully written to file
Wed, 23 Jun 2021 11:56:07 - citation_extractor.io.converters - [INFO] Document bmcr_2013-01-10 has 75 sentences
Wed, 23 Jun 2021 11:56:07 - citation_extractor.Utils.IO - [INFO] Finished proce

At this point, the JSON output file will contain, among other things, a list of the extracted named entities (i.e. citation components). 

In [43]:
cat data/json/bmcr_2013-01-10.json | jq ".entities"

[1;39m{
  [0m[34;1m"11"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"surface"[0m[1;39m: [0m[0;32m"Thucydide"[0m[1;39m,
    [0m[34;1m"end_offset"[0m[1;39m: [0m[0;39m8562[0m[1;39m,
    [0m[34;1m"id"[0m[1;39m: [0m[0;32m"11"[0m[1;39m,
    [0m[34;1m"start_offset"[0m[1;39m: [0m[0;39m8553[0m[1;39m,
    [0m[34;1m"entity_type"[0m[1;39m: [0m[0;32m"AAUTHOR"[0m[1;39m
  [1;39m}[0m[1;39m,
  [0m[34;1m"10"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"surface"[0m[1;39m: [0m[0;32m"Thucydides,"[0m[1;39m,
    [0m[34;1m"end_offset"[0m[1;39m: [0m[0;39m8427[0m[1;39m,
    [0m[34;1m"id"[0m[1;39m: [0m[0;32m"10"[0m[1;39m,
    [0m[34;1m"start_offset"[0m[1;39m: [0m[0;39m8416[0m[1;39m,
    [0m[34;1m"entity_type"[0m[1;39m: [0m[0;32m"AAUTHOR"[0m[1;39m
  [1;39m}[0m[1;39m,
  [0m[34;1m"13"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"surface"[0m[1;39m: [0m[0;32m"(Hdt."[0m[1;39m,
    [0m[34;1m"end_offset"[0m[1;39m: [0

## Relation extraction

The relation extraction step groups together components that are part of the same citation. This step is necessary to reconstruct the existing logical relation between consecutive citations to the same work. 

In [44]:
!citedloci-pipeline do relex --config=config/project.ini

No handlers could be found for logger "citation_extractor.crfpp_wrap"
Wed, 23 Jun 2021 11:56:10 - citation_extractor.Utils.IO - [INFO] Logger initialised
Wed, 23 Jun 2021 11:56:10 - citation_extractor.Utils.IO - [INFO] Current working directory: /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/data
Wed, 23 Jun 2021 11:56:10 - citation_extractor.Utils.IO - [INFO] There are 1 docs to process
Wed, 23 Jun 2021 11:56:10 - citation_extractor.Utils.IO - [INFO] Following documents will be processed: [u'bmcr_2013-01-10.json']
Wed, 23 Jun 2021 11:56:10 - citation_extractor.Utils.IO - [INFO] Document bmcr_2013-01-10.json contains 14 entities.
Wed, 23 Jun 2021 11:56:10 - citation_extractor.Utils.IO - [INFO] Document bmcr_2013-01-10.json (14 entities, 4 relations) written to /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/data/json/bmcr_2013-01-10.json


Each relation receives an ID (e.g. `R4`) and is made of two components (that we call *arguments*). Each argument is the ID of the corresponding entity.

In [45]:
cat data/json/bmcr_2013-01-10.json | jq ".|.relations"

[1;39m{
  [0m[34;1m"R4"[0m[1;39m: [0m[1;39m[
    [0;32m"13"[0m[1;39m,
    [0;32m"14"[0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"R1"[0m[1;39m: [0m[1;39m[
    [0;32m"3"[0m[1;39m,
    [0;32m"4"[0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"R2"[0m[1;39m: [0m[1;39m[
    [0;32m"5"[0m[1;39m,
    [0;32m"6"[0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"R3"[0m[1;39m: [0m[1;39m[
    [0;32m"7"[0m[1;39m,
    [0;32m"8"[0m[1;39m
  [1;39m][0m[1;39m
[1;39m}[0m


## Named entity linking

Finally, the last step consists in assigning a unique identifier (CTS URN) to each canonical citation (relation) that has been previously extracted from text.

⚠️If you are executing this notebook from Binder, this step is likely to take a very long time to execute, due to the limitations of the remote knowledge base that is used by default. For faster processing times, it's recommended to set up a local triple store for the knowledge base, and then point to it in the `[ned]` section of the project configuration file ([`config/project.ini`](config/project.ini)).

In [46]:
!citedloci-pipeline do ned --config=config/project.ini

No handlers could be found for logger "citation_extractor.crfpp_wrap"
Wed, 23 Jun 2021 11:56:13 - citation_extractor.Utils.IO - [INFO] Logger initialised
Wed, 23 Jun 2021 11:56:13 - citation_extractor.Utils.IO - [INFO] Current working directory: /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/data
Wed, 23 Jun 2021 11:56:14 - citation_extractor.Utils.IO - [INFO] CitationMatcher loaded from pickle /media/romanell/4T/matteo/ClassicsCitations/IndexLocorum101/pickles/matcher.pkl
Wed, 23 Jun 2021 11:56:14 - citation_extractor.Utils.IO - [INFO] There are 1 docs to process
Wed, 23 Jun 2021 11:56:14 - citation_extractor.Utils.IO - [INFO] Following documents will be processed: [u'bmcr_2013-01-10.json']
(CITATION (REF (SCOPE_S (LEVEL 3) (LEVEL 38) (LEVEL 5))))
Wed, 23 Jun 2021 11:56:16 - citation_extractor.Utils.IO - [INFO] 3.38.5). => 3.38.5
(CITATION (REF (SCOPE_S (LEVEL 1) (LEVEL 101) (LEVEL 2))))
Wed, 23 Jun 2021 11:56:16 - citation_extractor.Utils.IO - [INFO] 1.101.2. => 1.101.2


If you inspect again now the JSON output file, you will notice that some entities were enriched with attributes like `urn` and `work_uri`. These attributes indicates that the entity was disambiguated and linked with the corresponding record in the HuCit knowledge base. 

In [47]:
cat data/json/bmcr_2013-01-10.json | jq ".|.entities"

[1;39m{
  [0m[34;1m"11"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"end_offset"[0m[1;39m: [0m[0;39m8562[0m[1;39m,
    [0m[34;1m"entity_type"[0m[1;39m: [0m[0;32m"AAUTHOR"[0m[1;39m,
    [0m[34;1m"id"[0m[1;39m: [0m[0;32m"11"[0m[1;39m,
    [0m[34;1m"start_offset"[0m[1;39m: [0m[0;39m8553[0m[1;39m,
    [0m[34;1m"surface"[0m[1;39m: [0m[0;32m"Thucydide"[0m[1;39m
  [1;39m}[0m[1;39m,
  [0m[34;1m"10"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"end_offset"[0m[1;39m: [0m[0;39m8427[0m[1;39m,
    [0m[34;1m"entity_type"[0m[1;39m: [0m[0;32m"AAUTHOR"[0m[1;39m,
    [0m[34;1m"id"[0m[1;39m: [0m[0;32m"10"[0m[1;39m,
    [0m[34;1m"start_offset"[0m[1;39m: [0m[0;39m8416[0m[1;39m,
    [0m[34;1m"surface"[0m[1;39m: [0m[0;32m"Thucydides,"[0m[1;39m
  [1;39m}[0m[1;39m,
  [0m[34;1m"13"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"entity_type"[0m[1;39m: [0m[0;32m"REFAUWORK"[0m[1;39m,
    [0m[34;1m"urn"[0m[1;39m: [

# Read extracted citations

Now that the processing is complete, let's see how to compile a list of extracted citations, together with their identifiers.

To do so, it is necessary to read the JSON output file and prepare the data so that it can stored, for example, into a pandas' `DataFrame`.

In [48]:
import os
import codecs
import json
import pandas as pd

def read_json(doc_dir, doc_id):
    
    inp_file_path = os.path.join(doc_dir, doc_id)
    records = []
    
    # read input file
    with codecs.open(inp_file_path, 'r', 'utf-8') as inpfile:
        doc = json.load(inpfile)
        
    # iterate through the extracted relations
    for relation in doc['relations']:
        
        # for each relation, resolve the entity ID
        # to get the corresponding record from the JSON document
        arg1_id, arg2_id = doc['relations'][relation]
        arg1 = doc['entities'][arg1_id]
        arg2 = doc['entities'][arg2_id]
        
        # if the current relation is unlinked (it has a NIL identifier)
        # it will have no scope, so treat it differently
        if arg1['urn'] != 'urn:cts:GreekLatinLit:NIL':
            passage_urn = arg1['urn'] + ":" + arg2['norm_scope']
        else:
            passage_urn = None
        
        # append a dictionary to the list of records
        records.append({
            "docid": doc_id,
            "surface": arg1['surface'] + " " + arg2['surface'],
            "passage_urn": passage_urn,
            "work_urn": arg1['urn'],
            "work_uri": arg1['work_uri'] if 'work_uri' in arg1 else None
        })
        
    return records

In [49]:
# we use a custom function (see cell above) to read the
# output JSON file into a DataFrame

data = pd.DataFrame(
    read_json('data/json/', 'bmcr_2013-01-10.json')
).set_index('docid')

In [50]:
df

Unnamed: 0_level_0,passage_urn,surface,work_uri,work_urn
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bmcr_2013-01-10.json,urn:cts:greekLit:tlg0016.tlg001:3.38.5,(Hdt. 3.38.5).,http://purl.org/hucit/kb/works/2691,urn:cts:greekLit:tlg0016.tlg001
bmcr_2013-01-10.json,urn:cts:greekLit:tlg0003.tlg001:1.101.2,(Thuc. 1.101.2.,http://purl.org/hucit/kb/works/3998,urn:cts:greekLit:tlg0003.tlg001
bmcr_2013-01-10.json,urn:cts:greekLit:tlg0032.tlg001:3.3,Xen. Hell. 3.3),http://purl.org/hucit/kb/works/4025,urn:cts:greekLit:tlg0032.tlg001
bmcr_2013-01-10.json,,"Dike 12-13,",,urn:cts:GreekLatinLit:NIL


Now you can compare the above list of extracted citations with the original input document. As you can see, some references were correctly identified while others were missed. 

Interestingly "Dike 12-13," looked like a canonical citation to the extractor, but ultimately it did not receive a URN.

In [51]:
!cat data/orig/bmcr_2013-01-10.txt

Bryn Mawr Classical Review
BMCR 2013.01.10 on the BMCR blog

Bryn Mawr Classical Review 2013.01.10
Mélina​ Tamiolaki, Liberté et esclavage chez les historiens grecs classiques. Hellenica​.   Paris:  Presses de l'Université Paris-Sorbonne (PUPS)​, 2010.  Pp. 503.  ISBN 9782840506881.  €28.00 (pb).   

Reviewed by Alberto Maffi, University of Milano-Bicocca (alberto.maffi@unimib.it)
The book is divided into three parts: Liberté et esclavage entre les cités ou les peuples; Liberté et esclavage à l’intérieur des cités ou des peuples; Liberté et esclavage en dehors de la cité. Each of the first two parts is in turn divided into three chapters, dedicated respectively to the analysis of the work of Herodotus, Thucydides and Xenophon. The third part is divided into two chapters, and is devoted to analyzing all of the works of Xenophon: the first chapter, entitled Les limites et les ambiguïtés de la soumission volontaire au chef charismatique, contains an analysis of the Cyropedia, which