# Week 2

## Resources

- https://ipython.readthedocs.io/en/stable/interactive/magics.html
- https://www.opengreekandlatin.org/what-is-a-cts-urn/
- https://cite-architecture.github.io/xcite/ctsurn-quick/

## Catch up and review

### Reading a file into memory

Can you read one of the files from last week into memory? Enter the code to do so below.

In [7]:
# your code for reading a file goes here.
with open('../week-01/austen-pride-and-prejudice.txt') as f:
    austen = f.readlines()

## Git forking, branching, and pushing

In this section, we're going to talk about git forking, branching, and pushing, as this will be the main way that you'll submit homework.

First, you'll want to navigate to the GitHub repository for this course (https://github.com/Tufts-2024-Quant-Text-Analysis/intro-text-analysis) and press the "Fork" button:

![Screenshot of the Fork button](./img/fork.png)

You should then see a menu that looks something like this:

![Screenshot of Fork menu](./img/fork-menu.png)

You can rename the repository if you wish, just make sure to keep track of what you rename it to!

Once you have forked the main repository, go to your fork and click the "Code" button:

![Screenshot of the Code button](./img/code.png)

You can then clone your fork by copying the URL from the dropdown and entering the following in your terminal:

```sh
git clone YOUR_GIT_URL_HERE
```

### Setting up an upstream

By default, your own fork of the repository will be the `origin` for this clone. It is a convention when working with git forks to call the "main" repository `upstream`. You can add `upstream` as a remote by running the following from within your clone:

```sh
git remote add upstream https://github.com/Tufts-2024-Quant-Text-Analysis/intro-text-analysis.git
```

If you know run `git remote -v` from that directory, you should see both `origin` and `upstream`.

**NEVER** push directly to `upstream`. Instead, **`pull`** from `upstream` and **`push`** to your fork.

Whenever you push your work to your fork, you can navigate to it (on the web) and see an option to create a Pull Request:

![Screenshot of pull request](./img/pull-request.png)

Try it now: create a small change (you can just add one of your answers to Week 01), and push it to your fork.

I will create a branch that matches each of your usernames on the main repository. Open your pull request against this branch.

Now we have an easy way of looking at the changes that you've made and comparing them to the main repository without clobbering each other's work.



## Visualizing Data 

### Discuss 

- What kind(s) of visualization would be best for showing the relative frequencies of a verb like καλός in the Platonic corpus versus Thucydides?

Refer to @Brezina2018 [ch. 1] if you feel stuck.

## Installing packages

Inside a Jupyter/Colab notebook (they're functionally the same thing), you can install packages with the magic command `%pip`. See [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html) for more info on iPython/Jupyter Notebook magic commands.

We're going to need the `lxml` package in a moment, so let's go ahead and install it.

In [8]:
%pip install lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/pletcher/code/classes/quant-text-analysis/.venv/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## CTS URNs

Before we go further, we're going to need to talk a bit about Canonical Text Services Universal Resource Names -- or **CTS URN**s for short.

### Collection

CTS URNs allow us to specify text down to the token level. They work as references to specific components of a larger corpus, starting with the **collection**.

```
urn:cts:greekLit
```

The prefix `urn:cts:` is required by the protocol; `greekLit` refers to the collection of Greek texts known to the CTS implementation.

### Work Component

The next element in a CTS URN is collectively referred to as the **work component**. At a minimum, it contains a reference to a **text group**. 

#### Text Group

Text groups are often what we think of as authors, but by treating them as placeholders not for a specific writer but for canonically related texts, we can stay one step ahead of issues about attribution etc. Text groups are not meant to make any assertions about authorship; they're just a convenient way to find things.

For example, the _Rhesus_ is contained within the Euripides text group by convention; we aren't weighing in on that vexed authorship question.

```
urn:cts:greekLit:tlg0525
```

`tlg0525` refers to Pausanias. You can use https://cts.perseids.org/ to look up URNs, but since we'll be working a lot with Pausanias' _Periegesis_, it might be a good idea to get used to tlg0525.

Why `tlg0525` and not just `pausanias`? Names and their orthography are hard to standardize. CTS URNs are designed to be **universal** and portable. Using names as identifiers too early on would lead to unnecessary confusion.

Should we have settled on a system other than the numbering that the TLG came up with? Probably, but we're decades too late to change that now.


#### Work

Next comes the **work**. This refers to the item -- in our case it will usually be a text -- under the text group.

```
urn:cts:greekLit:tlg0525.tlg001
```

Notice that `tlg001` is separated from `tlg0525` by a `.`, rather than a `:`. This is because only major components of the URN are separated by `:`; minor components, such as the sub-components of the major **work component**, are separated by `.`.

Works within the work component are usually numbered sequentially for the items that we're dealing with. For Sophocles, the sequence starts with _Trachiniae_, so `urn:cts:greekLit:tlg0011.tlg001` refers to that text; `urn:cts:greekLit:tlg0011.tlg003`, for example, refers to _Ajax_.

#### Version

For classical texts, which have any number of editions published over the years, the **version** is essential. It helps us point to a specific edition of the work, complete with that editions editorial interventions.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2
```

The `perseus-grc2` version refers to the second Greek edition of the _Periegesis_ as published by the Perseus Digital Library.

#### Exemplar

There is another element in the work component of CTS URNs, the **exemplar**. You might think of this as a reference to the specific _witness_ that you're dealing with.

We won't need to use this much for retrieving texts, but if you're working on different ways of handling textual material, you might want to append an exemplar fragment to the work component.

For example, if I've made additional annotations to the Perseus _Periegesis_ for my own research that don't necessarily belong in the canonical version (via a pull request vel sim.), I might dub my local exemplar:

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2.charles-annotations
```

That way I know that this URN refers to an exemplar that contains annotations that might not be present in the parent `perseus-grc2`.

I want to emphasize, again, you can get through this course just fine without ever using an exemplar fragment. They're under-specified and confusing, but I mention them here for the sake of completeness.

### Passage Component

Finally, CTS URNs can have a **passage component**. This is the most specific part of the CTS URN, containing references to precise passages and even words within a text.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1
```

The `1` above refers to Pausanias Book 1.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1
```

Now it references Book 1, Chapter 1.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1-2.2
```

Now we're talking about a passage spanning Book 1, Chapter 1, to Book 2, Chapter 2.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1.5@Κωλιάς
```

Now we're referencing the token `Κωλιάς` in Book 1, Chapter 1, Section 5.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1.5@Κωλιάδος-1.1.5@θεαί
```

As a final example, this URN references the span from Κωλιάδος to θεαί in Book 1, Chapter 1, Section 5.

## Getting text to work with

So why this detour on CTS URNs? Because it will make it easier for you to find the texts that you need. I've already added the perseus-grc2 version of Pausanias to this repository; you can find it under `tei/tlg0525.tlg001.perseus-grc2.xml`. (The URN is abbreviated because the file comes from the [PerseusDL/canonical-greekLit](https://github.com/PerseusDL/canonical-greekLit/) repo on GitHub.)

Ideally, we would be able to request these texts from an API, but as of this writing in August 2024, all of the known APIs are not working. So for now, we will parse these files locally and transform them into data structures that facilitate our analyses.

We'll first need to install the MyCapytains library.

In [1]:
%pip install MyCapytain

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Then we can import this module and use it to ingest the text of Pausanias, stored in the `tei/` directory of this repo.

In [2]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText

with open("../tei/tlg0525.tlg001.perseus-grc2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-grc2", resource=f)

Let's try turning the text into just a [Pandas DataFrame](https://pandas.pydata.org/docs/index.html) with columns for the CTS URN, the corresponding XML, and the unannotated text of the passage.

In [3]:
# this block might take a while

from lxml import etree
from MyCapytain.common.constants import Mimetypes

urns = []
raw_xmls = []
unannotated_strings = []

for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

In [4]:
import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

## Working with textual data

Now that we have some text to work with -- and by "some text," I mean all 3170 sections of Pausanias in the above DataFrame -- we can start working with the data.

Before doing so, however, we should ask _how_ we're going to make the data more manageable -- it isn't exactly feasible to dive headfirst into a corpus of this size.

> Discuss: What units can we break Pausanias down into to make it more manageable? Don't worry about how you would do it in code yet, just think about how you might explore the units of the text.


## Types of words

With all languages, but especially with heavily-inflected languages like ancient Greek and Latin, it is important to be precise about the kinds of word forms that we're dealing with.

#### Tokens 

A **token** or **running word** "is a single occurrence of a word form in the text" [@Brezina2018 39].

How can we count the number of tokens in all of Pausanias? First we need to **tokenize** the `unannotated_strings` column of `pausanias_df`.

Tokenization is a surprisingly complicated process depending on the language of study, and we will learn more sophisticated methods for tokenizing Greek text as we go along.

For now, however, let's define a token as "whitespace-delimited text" -- we're not going to worry about punctuation etc. just yet.

So to tokenize the `unannotated_strings` column of `pausanias_df`, we can run:

In [13]:
# See https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html for
# panda's string-splitting utilities; it splits on whitespace by default
pausanias_df['whitespaced_tokens'] = pausanias_df['unannotated_strings'].str.split()

pausanias_df

Unnamed: 0,urn,raw_xml,unannotated_strings,whitespaced_tokens
0,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",τῆς ἠπείρου τῆς Ἑλληνικῆς κατὰ νήσους τὰς Κυκλ...,"[τῆς, ἠπείρου, τῆς, Ἑλληνικῆς, κατὰ, νήσους, τ..."
1,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","ὁ δὲ Πειραιεὺς δῆμος μὲν ἦν ἐκ παλαιοῦ, πρότερ...","[ὁ, δὲ, Πειραιεὺς, δῆμος, μὲν, ἦν, ἐκ, παλαιοῦ..."
2,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",θέας δὲ ἄξιον τῶν ἐν Πειραιεῖ μάλιστα Ἀθηνᾶς ἐ...,"[θέας, δὲ, ἄξιον, τῶν, ἐν, Πειραιεῖ, μάλιστα, ..."
3,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἔστι δὲ καὶ ἄλλος Ἀθηναίοις ὁ μὲν ἐπὶ Μουνυχίᾳ...,"[ἔστι, δὲ, καὶ, ἄλλος, Ἀθηναίοις, ὁ, μὲν, ἐπὶ,..."
4,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἀπέχει δὲ σταδίους εἴκοσιν ἄκρα Κωλιάς· ἐς ταύ...,"[ἀπέχει, δὲ, σταδίους, εἴκοσιν, ἄκρα, Κωλιάς·,..."
...,...,...,...,...
3165,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",οὗτοι μὲν δὴ ὑπεροικοῦσιν Ἀμφίσσης· ἐπὶ θαλάσσ...,"[οὗτοι, μὲν, δὴ, ὑπεροικοῦσιν, Ἀμφίσσης·, ἐπὶ,..."
3166,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",κληθῆναι δὲ ἀπὸ γυναικὸς ἢ νύμφης τεκμαίρομαι ...,"[κληθῆναι, δὲ, ἀπὸ, γυναικὸς, ἢ, νύμφης, τεκμα..."
3167,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",τὰ δὲ ἔπη τὰ Ναυπάκτια ὀνομαζόμενα ὑπὸ Ἑλλήνων...,"[τὰ, δὲ, ἔπη, τὰ, Ναυπάκτια, ὀνομαζόμενα, ὑπὸ,..."
3168,urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",ἐνταῦθα ἔστι μὲν ἐπὶ θαλάσσῃ ναὸς Ποσειδῶνος κ...,"[ἐνταῦθα, ἔστι, μὲν, ἐπὶ, θαλάσσῃ, ναὸς, Ποσει..."


Okay, now that we have some arrays of tokens in `whitespaced_tokens` column, how do we count them?

In [14]:
sum(len(ts) for ts in pausanias_df['whitespaced_tokens'])

217416

> Discuss: What does the above line of code do?

The above line of code is not very idiomatic for Pandas, however. Instead, we should write something like the following:

In [15]:
pausanias_df['whitespaced_tokens'].explode().count()

217416

### Types

A **type** is a unique word form in the corpus. For example, the inflected forms βουλεύεται and βουλεύομεν are each a type. (See @Brezina2018 [39-40].)

In other words, **types** are **tokens** grouped by form. So to count the number of **types** in Pausanias, we can do the following:

In [16]:
len(pausanias_df['whitespaced_tokens'].explode().unique())

41363

> Discuss: Break the above line of code down method by method.

What if we want to see the top `n` types in the corpus?

In [17]:
from collections import Counter

type_counts = Counter(pausanias_df['whitespaced_tokens'].explode())

type_counts.most_common(100)

[('καὶ', 11810),
 ('δὲ', 10616),
 ('ἐς', 4162),
 ('τοῦ', 3392),
 ('τὸ', 3262),
 ('ἐν', 3114),
 ('τὴν', 2940),
 ('τε', 2791),
 ('μὲν', 2791),
 ('τῶν', 2375),
 ('τὸν', 2316),
 ('τῆς', 2299),
 ('ὁ', 1978),
 ('οἱ', 1976),
 ('τὰ', 1931),
 ('ἐπὶ', 1908),
 ('τῷ', 1877),
 ('τῇ', 1575),
 ('ὡς', 1117),
 ('τοῖς', 1061),
 ('τοὺς', 1032),
 ('ἐκ', 1018),
 ('ὑπὸ', 1000),
 ('δὴ', 975),
 ('ἐστιν', 962),
 ('ἡ', 942),
 ('οὐ', 916),
 ('γὰρ', 874),
 ('κατὰ', 802),
 ('πρὸς', 779),
 ('ἀπὸ', 776),
 ('εἶναι', 641),
 ('παρὰ', 633),
 ('οὐκ', 631),
 ('τὰς', 522),
 ('ἔτι', 520),
 ('δέ', 494),
 ('ἢ', 472),
 ('ἐξ', 455),
 ('ἐστὶν', 454),
 ('αὐτὸν', 446),
 ('αὐτῷ', 444),
 ('ἐνταῦθα', 442),
 ('ἦν', 441),
 ('ἔστι', 402),
 ('μάλιστα', 393),
 ('τοῦτο', 388),
 ('ἱερὸν', 363),
 ('ἄγαλμα', 361),
 ('ἐστι', 358),
 ('μετὰ', 358),
 ('σφισιν', 328),
 ('ὕστερον', 328),
 ('αὐτῶν', 328),
 ('περὶ', 319),
 ('οὖν', 319),
 ('ταῖς', 305),
 ('ὄνομα', 295),
 ('γε', 275),
 ('γενέσθαι', 275),
 ('τότε', 274),
 ('αἱ', 274),
 ('ὅτι', 270),
 ('

### Stop words

Hm, that's not particularly interesting -- most of these words are fairly common and will rank highly in almost any corpus. Further, since we haven't accounted for punctuation, we're probably generating frequencies incorrectly based on whether or not a type is joined to any punctuation. We need to get a bit more sophisticated.

Let's install `spacy` and `grecy` to perform better tokenization and incorporate the notion of a **stop word**: a token that is so common that including it in most statistical analyses will just generate noise.

In [5]:
## Uncomment the line for your system's architecture
%pip install spacy
# %pip install 'spacy[apple]'
%pip install grecy

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Now we need to install a model for `grecy` to use. Note that there is a known but so-far unpatched issue where this command will only work with Python 3.11.9 and pip 24.0 (or a bit older in either case).

In [6]:
%run -m grecy install grc_proiel_sm


Installing grc_proiel_sm.....

Please wait, this could take some minutes.....

Defaulting to user installation because normal site-packages is not writeable
Collecting grc-proiel-sm==any
Downloading https://huggingface.co/Jacobo/grc_proiel_sm/resolve/main/grc_proiel_sm-any-py3-none-any.whl (65.5 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/65.5 MB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/65.5 MB[0m [31m27.4 MB/s[0m eta [36m0:00:03[0m
[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/65.5 MB[0m [31m34.3 MB/s[0m eta [36m0:00:02[0m
[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/65.5 MB[0m [31m35.0 MB/s[0m eta [36m0:00:02[0m
[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/65.5 MB[0m [31m36.8 MB/s[0m eta [36m0:00:02[0m
[2K     [91m━━━━[0m[90m╺[0m[90m━━━━

In [7]:
import spacy

nlp = spacy.load("grc_proiel_trf", disable=["ner"])

ValueError: 'in' is not a valid parameter name

In [8]:
tokenizer = nlp.tokenizer

pausanias_df['tokens'] = pausanias_df['unannotated_strings'].apply(tokenizer)

NameError: name 'nlp' is not defined

The tokenization process through SpaCy adds some features to each of the tokens in the `tokens` column. Now we can collect the types and exclude stop words using the `token.is_stop` attribute.

In [40]:
types = [t.text for t in pausanias_df['tokens'].explode() if not t.is_stop and t.is_alpha]

type_counts = Counter(types)

type_counts.most_common(100)

[('ἐνταῦθα', 491),
 ('μάλιστα', 421),
 ('ἄγαλμα', 411),
 ('ἱερὸν', 369),
 ('ὕστερον', 351),
 ('σφισιν', 347),
 ('ὄνομα', 345),
 ('γενέσθαι', 329),
 ('λέγουσιν', 328),
 ('τότε', 287),
 ('ἐπʼ', 268),
 ('πόλιν', 258),
 ('ἤδη', 257),
 ('σφᾶς', 250),
 ('φασιν', 238),
 ('δʼ', 234),
 ('πρότερον', 227),
 ('σφίσιν', 216),
 ('παῖδα', 215),
 ('ἐγένετο', 203),
 ('πεποίηται', 199),
 ('ὕδωρ', 196),
 ('λέγουσι', 194),
 ('Λακεδαιμονίων', 190),
 ('ἐφʼ', 186),
 ('ἔχει', 184),
 ('αὖθις', 183),
 ('ἱερόν', 183),
 ('λίθου', 183),
 ('Ἀθηνᾶς', 178),
 ('θεῶν', 176),
 ('Ἀθηναίων', 168),
 ('λέγεται', 167),
 ('Ἀπόλλωνος', 166),
 ('ἀγάλματα', 164),
 ('Ἑλλήνων', 164),
 ('σφισι', 162),
 ('δύο', 162),
 ('πρὸ', 160),
 ('φασὶν', 159),
 ('ἔνθα', 158),
 ('ὁμοῦ', 154),
 ('ἅτε', 154),
 ('ἀρχῆς', 153),
 ('Διὸς', 151),
 ('Ἀθηναίοις', 149),
 ('Ἀχαιῶν', 146),
 ('πολὺ', 142),
 ('Ἀρτέμιδος', 139),
 ('ἔργον', 138),
 ('ἀρχαῖον', 138),
 ('ἐπίκλησιν', 135),
 ('ἀπʼ', 134),
 ('γῆν', 134),
 ('Λακεδαιμόνιοι', 134),
 ('παίδων', 132),
 ('

Much better! We now have a list of the most common types, exluding stop words and punctuation.

Be careful, though: these are still just raw counts, and they tell us very little about how we might characterize Pausanias vis-à-vis a larger corpus.

## Lemmata/lemmas

A **lemma** (plural **lemmata** or **lemmas**) represents "a group of all inflectional forms related to one stem that belong to the same word class (Kučera & Francis 1967: 1)" [@Brezina2018 40]. In simpler terms, a **lemma** is the dictionary form of a word, so **lemmata** give us a way of further reducing the word count. ἐστίν, ἔσμεν, and εἰσίν all have the same **lemma**: εἰμί.

Lemmatization, as you might guess, often involves additional processing. Luckily, we can use the SpaCy and GreCy models again.

In [41]:
raw_texts = [t for t in pausanias_df['unannotated_strings']]
annotated_texts = nlp.pipe(raw_texts, batch_size=100)

pausanias_df['nlp_docs'] = list(annotated_texts)

  with torch.cuda.amp.autocast(self._mixed_precision):


In [53]:
lemmata = [t.lemma_ for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lemmata_counts = Counter(lemmata)

lemmata_counts.most_common(100)

[('ποιέω', 1201),
 ('λέγω', 1185),
 ('ἔχω', 1152),
 ('σφεῖς', 1093),
 ('γίγνομαι', 1090),
 ('παῖς', 827),
 ('καλέω', 797),
 ('πόλις', 778),
 ('φημί', 765),
 ('πολύς', 742),
 ('ἄγαλμα', 697),
 ('ἀνήρ', 678),
 ('ἱερόν', 657),
 ('θεός', 644),
 ('Λακεδαιμόνιος', 574),
 ('ἐνταῦθα', 525),
 ('λόγος', 482),
 ('πᾶς', 473),
 ('ἐπί', 454),
 ('μάλα', 449),
 ('ναός', 445),
 ('ὄνομα', 434),
 ('γυνή', 393),
 ('Ἀθηναῖος', 383),
 ('ὅσος', 380),
 ('ὕστερος', 376),
 ('γῆ', 371),
 ('Μεσσήνιος', 370),
 ('μέγας', 369),
 ('Ἀπόλλων', 359),
 ('Ἕλλην', 358),
 ('ὀνομάζω', 351),
 ('πρῶτος', 350),
 ('ἀρχή', 345),
 ('ἀφικνέομαι', 337),
 ('Ζεύς', 327),
 ('στάδιον', 327),
 ('Ἠλεῖος', 321),
 ('ποταμός', 320),
 ('ἄγω', 319),
 ('ἄνθρωπος', 316),
 ('ἔργον', 311),
 ('ἀρχαῖος', 308),
 ('πρότερος', 306),
 ('πόλεμος', 303),
 ('Ἀχαιός', 298),
 ('τότε', 291),
 ('θάλασσα', 288),
 ('Ἀργεῖος', 285),
 ('μάχη', 281),
 ('ὄρος', 273),
 ('Ἀρκάς', 272),
 ('λίθος', 272),
 ('ὕδωρ', 267),
 ('ἕτερος', 265),
 ('Ἄρτεμις', 262),
 ('χώρα', 257

#### Lexemes

Finally, "a **lexeme** is a lemma with a particular meaning attached to it.... The best way of conceptualizing a lexeme is as a subentry in a dictionary" [@Brezina2018 40].

One challenge of working with lexemes is that, even with the advances of large language models like ChatGPT, there is no surefire way to annotate them automatically. We still need "human-in-the-loop" pipelines to catch errors and ambiguities. And keep in mind that even two humans might disagree on the lexeme for a particular word!

But we can inspect the `lex` attributes of the tokens that SpaCy has generated for us and see if they make sense.

In [55]:
lexemes = [(t.text, t.lex) for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lexeme_counts = Counter(lexemes)

lexeme_counts.most_common(100)

[(('ἐνταῦθα', <spacy.lexeme.Lexeme at 0x39f333c80>), 491),
 (('μάλιστα', <spacy.lexeme.Lexeme at 0x39f333980>), 421),
 (('ἄγαλμα', <spacy.lexeme.Lexeme at 0x392a7dbc0>), 411),
 (('ἱερὸν', <spacy.lexeme.Lexeme at 0x392abfe40>), 369),
 (('ὕστερον', <spacy.lexeme.Lexeme at 0x392a7f840>), 351),
 (('σφισιν', <spacy.lexeme.Lexeme at 0x39f332b00>), 347),
 (('ὄνομα', <spacy.lexeme.Lexeme at 0x392a89180>), 345),
 (('γενέσθαι', <spacy.lexeme.Lexeme at 0x392a88300>), 329),
 (('λέγουσιν', <spacy.lexeme.Lexeme at 0x392a7e240>), 328),
 (('τότε', <spacy.lexeme.Lexeme at 0x392a80980>), 287),
 (('ἐπʼ', <spacy.lexeme.Lexeme at 0x392a851c0>), 268),
 (('πόλιν', <spacy.lexeme.Lexeme at 0x392a7e3c0>), 258),
 (('ἤδη', <spacy.lexeme.Lexeme at 0x392aa1b00>), 257),
 (('σφᾶς', <spacy.lexeme.Lexeme at 0x392a92680>), 250),
 (('φασιν', <spacy.lexeme.Lexeme at 0x392a7d500>), 238),
 (('δʼ', <spacy.lexeme.Lexeme at 0x392ac2a80>), 234),
 (('πρότερον', <spacy.lexeme.Lexeme at 0x39f331d80>), 227),
 (('σφίσιν', <spacy.lex

Wait a second -- this list looks identical to our list of word types.

Sure enough, when we check the SpaCy documentation for [Lexeme](https://spacy.io/api/lexeme):

> A Lexeme has no string context – it’s a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma (if lemmatization depends on the part-of-speech tag).


## Review

> Discuss: Define **token**, **type**, **stop word**, **lemma**, and **lexeme** in your own words.

> Discuss: How can we use these different notions of "word" in our analysis of corpora? Why is it important to be precise about what kind of word(s) we're using?

## Homework

1. Read @Brezina2018 [ch. 2, pp. 41--65].
2. Choose 3 books of Pausanias and calculate the most common tokens, types, and lemmata for each. In a paragraph or so, describe your findings relative to the work we have done in class today.
3. Using your findings from 2., write a short (1-page) evaluation of one of the books of Pausanias that you have analyzed. Does your qualitative -- which is not to say "subjective" -- experience of reading the text cohere with your quantitative evaluation?