# Week 2

## Resources

- https://ipython.readthedocs.io/en/stable/interactive/magics.html
- https://www.opengreekandlatin.org/what-is-a-cts-urn/
- https://cite-architecture.github.io/xcite/ctsurn-quick/

## Catch up and review

### Reading a file into memory

Can you read one of the files from last week into memory in RStudio? Enter the code to do so below.

In [None]:
# your code for reading a file goes here.

## Visualizing Data 

### Discuss 

- What kind(s) of visualization would be best for showing the relative frequencies of a verb like καλός in the Platonic corpus versus Thucydides?

Refer to @Brezina2018 [ch. 1] if you feel stuck.

## Installing packages

Inside a Jupyter/Colab notebook (they're functionally the same thing), you can install packages with the magic command `%pip`. See [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html) for more info on iPython/Jupyter Notebook magic commands.

We're going to need the `lxml` package in a moment, so let's go ahead and install it.

In [5]:
%pip install lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/pletcher/code/classes/quant-text-analysis/.venv/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## CTS URNs

Before we go further, we're going to need to talk a bit about Canonical Text Services Universal Resource Names -- or **CTS URN**s for short.

### Collection

CTS URNs allow us to specify text down to the token level. They work as references to specific components of a larger corpus, starting with the **collection**.

```
urn:cts:greekLit
```

The prefix `urn:cts:` is required by the protocol; `greekLit` refers to the collection of Greek texts known to the CTS implementation.

### Work Component

The next element in a CTS URN is collectively referred to as the **work component**. At a minimum, it contains a reference to a **text group**. 

#### Text Group

Text groups are often what we think of as authors, but by treating them as placeholders not for a specific writer but for canonically related texts, we can stay one step ahead of issues about attribution etc. Text groups are not meant to make any assertions about authorship; they're just a convenient way to find things.

For example, the _Rhesus_ is contained within the Euripides text group by convention; we aren't weighing in on that vexed authorship question.

```
urn:cts:greekLit:tlg0525
```

`tlg0525` refers to Pausanias. You can use https://cts.perseids.org/ to look up URNs, but since we'll be working a lot with Pausanias' _Periegesis_, it might be a good idea to get used to tlg0525.

Why `tlg0525` and not just `pausanias`? Names and their orthography are hard to standardize. CTS URNs are designed to be **universal** and portable. Using names as identifiers too early on would lead to unnecessary confusion.

Should we have settled on a system other than the numbering that the TLG came up with? Probably, but we're decades too late to change that now.


#### Work

Next comes the **work**. This refers to the item -- in our case it will usually be a text -- under the text group.

```
urn:cts:greekLit:tlg0525.tlg001
```

Notice that `tlg001` is separated from `tlg0525` by a `.`, rather than a `:`. This is because only major components of the URN are separated by `:`; minor components, such as the sub-components of the major **work component**, are separated by `.`.

Works within the work component are usually numbered sequentially for the items that we're dealing with. For Sophocles, the sequence starts with _Trachiniae_, so `urn:cts:greekLit:tlg0011.tlg001` refers to that text; `urn:cts:greekLit:tlg0011.tlg003`, for example, refers to _Ajax_.

#### Version

For classical texts, which have any number of editions published over the years, the **version** is essential. It helps us point to a specific edition of the work, complete with that editions editorial interventions.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2
```

The `perseus-grc2` version refers to the second Greek edition of the _Periegesis_ as published by the Perseus Digital Library.

#### Exemplar

There is another element in the work component of CTS URNs, the **exemplar**. You might think of this as a reference to the specific _witness_ that you're dealing with.

We won't need to use this much for retrieving texts, but if you're working on different ways of handling textual material, you might want to append an exemplar fragment to the work component.

For example, if I've made additional annotations to the Perseus _Periegesis_ for my own research that don't necessarily belong in the canonical version (via a pull request vel sim.), I might dub my local exemplar:

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2.charles-annotations
```

That way I know that this URN refers to an exemplar that contains annotations that might not be present in the parent `perseus-grc2`.

I want to emphasize, again, you can get through this course just fine without ever using an exemplar fragment. They're under-specified and confusing, but I mention them here for the sake of completeness.

### Passage Component

Finally, CTS URNs can have a **passage component**. This is the most specific part of the CTS URN, containing references to precise passages and even words within a text.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1
```

The `1` above refers to Pausanias Book 1.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1
```

Now it references Book 1, Chapter 1.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1-2.2
```

Now we're talking about a passage spanning Book 1, Chapter 1, to Book 2, Chapter 2.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1.5@Κωλιάς
```

Now we're referencing the token `Κωλιάς` in Book 1, Chapter 1, Section 5.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1.5@Κωλιάδος-1.1.5@θεαί
```

As a final example, this URN references the span from Κωλιάδος to θεαί in Book 1, Chapter 1, Section 5.

## Getting text to work with

So why this detour on CTS URNs? Because it will make it easier for you to find the texts that you need. I've already added the perseus-grc2 version of Pausanias to this repository; you can find it under `tei/tlg0525.tlg001.perseus-grc2.xml`. (The URN is abbreviated because the file comes from the [PerseusDL/canonical-greekLit](https://github.com/PerseusDL/canonical-greekLit/) repo on GitHub.)

Ideally, we would be able to request these texts from an API, but as of this writing in August 2024, all of the known APIs are not working. So for now, we will parse these files locally and transform them into data structures that facilitate our analyses.

We'll first need to install the MyCapytains library.

In [1]:
%pip install MyCapytain

Note: you may need to restart the kernel to use updated packages.


Then we can import this module and use it to ingest the text of Pausanias, stored in the `tei/` directory of this repo.

In [12]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText

with open("../tei/tlg0525.tlg001.perseus-grc2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-grc2", resource=f)

Let's try turning the text into just a [Pandas DataFrame](https://pandas.pydata.org/docs/index.html) with columns for the CTS URN, the corresponding XML, and the unannotated text of the passage.

In [13]:
# this block might take a while

from lxml import etree
from MyCapytain.common.constants import Mimetypes

df_array = []

for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    df_array.append({'urn': urn, 'xml': raw_xml, 'unannotated': s})

In [14]:
import pandas as pd

pausanias_df = pd.DataFrame(df_array)

Unnamed: 0,urn,xml,unannotated
0,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",τῆς ἠπείρου τῆς Ἑλληνικῆς κατὰ νήσους τὰς Κυκλ...
1,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","ὁ δὲ Πειραιεὺς δῆμος μὲν ἦν ἐκ παλαιοῦ, πρότερ..."
2,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",θέας δὲ ἄξιον τῶν ἐν Πειραιεῖ μάλιστα Ἀθηνᾶς ἐ...
3,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἔστι δὲ καὶ ἄλλος Ἀθηναίοις ὁ μὲν ἐπὶ Μουνυχίᾳ...
4,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἀπέχει δὲ σταδίους εἴκοσιν ἄκρα Κωλιάς· ἐς ταύ...
...,...,...,...
3165,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",οὗτοι μὲν δὴ ὑπεροικοῦσιν Ἀμφίσσης· ἐπὶ θαλάσσ...
3166,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",κληθῆναι δὲ ἀπὸ γυναικὸς ἢ νύμφης τεκμαίρομαι ...
3167,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",τὰ δὲ ἔπη τὰ Ναυπάκτια ὀνομαζόμενα ὑπὸ Ἑλλήνων...
3168,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἐνταῦθα ἔστι μὲν ἐπὶ θαλάσσῃ ναὸς Ποσειδῶνος κ...
