# Week 2

## Resources

- https://ipython.readthedocs.io/en/stable/interactive/magics.html
- https://www.opengreekandlatin.org/what-is-a-cts-urn/
- https://cite-architecture.github.io/xcite/ctsurn-quick/

## Catch up and review

### Reading a file into memory

Can you read one of the files from last week into memory? Enter the code to do so below.

In [1]:
# your code for reading a file goes here.

with open('../week-01/austen-pride-and-prejudice.txt') as f:
    print(f.readlines())

It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife.

However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.



## Git forking, branching, and pushing

In this section, we're going to talk about git forking, branching, and pushing, as this will be the main way that you'll submit homework.

First, you'll want to navigate to the GitHub repository for this course (https://github.com/Tufts-2024-Quant-Text-Analysis/intro-text-analysis) and press the "Fork" button:

![Screenshot of the Fork button](./img/fork.png)

You should then see a menu that looks something like this:

![Screenshot of Fork menu](./img/fork-menu.png)

You can rename the repository if you wish, just make sure to keep track of what you rename it to!

Once you have forked the main repository, go to your fork and click the "Code" button:

![Screenshot of the Code button](./img/code.png)

You can then clone your fork by copying the URL from the dropdown and entering the following in your terminal:

```sh
git clone YOUR_GIT_URL_HERE
```

### Setting up an upstream

By default, your own fork of the repository will be the `origin` for this clone. It is a convention when working with git forks to call the "main" repository `upstream`. You can add `upstream` as a remote by running the following from within your clone:

```sh
git remote add upstream https://github.com/Tufts-2024-Quant-Text-Analysis/intro-text-analysis.git
```

If you know run `git remote -v` from that directory, you should see both `origin` and `upstream`.

**NEVER** push directly to `upstream`. Instead, **`pull`** from `upstream` and **`push`** to your fork.

Whenever you push your work to your fork, you can navigate to it (on the web) and see an option to create a Pull Request:

![Screenshot of pull request](./img/pull-request.png)

Try it now: create a small change (you can just add one of your answers to Week 01), and push it to your fork.

I will create a branch that matches each of your usernames on the main repository. Open your pull request against this branch.

Now we have an easy way of looking at the changes that you've made and comparing them to the main repository without clobbering each other's work.



## Visualizing Data 

### Discuss 

- What kind(s) of visualization would be best for showing the relative frequencies of a verb like καλός in the Platonic corpus versus Thucydides?

Refer to @Brezina2018 [ch. 1] if you feel stuck.

## Installing packages

Inside a Jupyter/Colab notebook (they're functionally the same thing), you can install packages with the magic command `%pip`. See [here](https://ipython.readthedocs.io/en/stable/interactive/magics.html) for more info on iPython/Jupyter Notebook magic commands.

We're going to need the `lxml` package in a moment, so let's go ahead and install it.

In [None]:
%pip install lxml

## CTS URNs

Before we go further, we're going to need to talk a bit about Canonical Text Services Universal Resource Names -- or **CTS URN**s for short.

### Collection

CTS URNs allow us to specify text down to the token level. They work as references to specific components of a larger corpus, starting with the **collection**.

```
urn:cts:greekLit
```

The prefix `urn:cts:` is required by the protocol; `greekLit` refers to the collection of Greek texts known to the CTS implementation.

### Work Component

The next element in a CTS URN is collectively referred to as the **work component**. At a minimum, it contains a reference to a **text group**. 

#### Text Group

Text groups are often what we think of as authors, but by treating them as placeholders not for a specific writer but for canonically related texts, we can stay one step ahead of issues about attribution etc. Text groups are not meant to make any assertions about authorship; they're just a convenient way to find things.

For example, the _Rhesus_ is contained within the Euripides text group by convention; we aren't weighing in on that vexed authorship question.

```
urn:cts:greekLit:tlg0525
```

`tlg0525` refers to Pausanias. You can use https://cts.perseids.org/ to look up URNs, but since we'll be working a lot with Pausanias' _Periegesis_, it might be a good idea to get used to tlg0525.

Why `tlg0525` and not just `pausanias`? Names and their orthography are hard to standardize. CTS URNs are designed to be **universal** and portable. Using names as identifiers too early on would lead to unnecessary confusion.

Should we have settled on a system other than the numbering that the TLG came up with? Probably, but we're decades too late to change that now.


#### Work

Next comes the **work**. This refers to the item -- in our case it will usually be a text -- under the text group.

```
urn:cts:greekLit:tlg0525.tlg001
```

Notice that `tlg001` is separated from `tlg0525` by a `.`, rather than a `:`. This is because only major components of the URN are separated by `:`; minor components, such as the sub-components of the major **work component**, are separated by `.`.

Works within the work component are usually numbered sequentially for the items that we're dealing with. For Sophocles, the sequence starts with _Trachiniae_, so `urn:cts:greekLit:tlg0011.tlg001` refers to that text; `urn:cts:greekLit:tlg0011.tlg003`, for example, refers to _Ajax_.

#### Version

For classical texts, which have any number of editions published over the years, the **version** is essential. It helps us point to a specific edition of the work, complete with that editions editorial interventions.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2
```

The `perseus-grc2` version refers to the second Greek edition of the _Periegesis_ as published by the Perseus Digital Library.

#### Exemplar

There is another element in the work component of CTS URNs, the **exemplar**. You might think of this as a reference to the specific _witness_ that you're dealing with.

We won't need to use this much for retrieving texts, but if you're working on different ways of handling textual material, you might want to append an exemplar fragment to the work component.

For example, if I've made additional annotations to the Perseus _Periegesis_ for my own research that don't necessarily belong in the canonical version (via a pull request vel sim.), I might dub my local exemplar:

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2.charles-annotations
```

That way I know that this URN refers to an exemplar that contains annotations that might not be present in the parent `perseus-grc2`.

I want to emphasize, again, you can get through this course just fine without ever using an exemplar fragment. They're under-specified and confusing, but I mention them here for the sake of completeness.

### Passage Component

Finally, CTS URNs can have a **passage component**. This is the most specific part of the CTS URN, containing references to precise passages and even words within a text.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1
```

The `1` above refers to Pausanias Book 1.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1
```

Now it references Book 1, Chapter 1.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1-2.2
```

Now we're talking about a passage spanning Book 1, Chapter 1, to Book 2, Chapter 2.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1.5@Κωλιάς
```

Now we're referencing the token `Κωλιάς` in Book 1, Chapter 1, Section 5.

```
urn:cts:greekLit:tlg0525.tlg001.perseus-grc2:1.1.5@Κωλιάδος-1.1.5@θεαί
```

As a final example, this URN references the span from Κωλιάδος to θεαί in Book 1, Chapter 1, Section 5.

## Getting text to work with

So why this detour on CTS URNs? Because it will make it easier for you to find the texts that you need. I've already added the perseus-grc2 version of Pausanias to this repository; you can find it under `tei/tlg0525.tlg001.perseus-grc2.xml`. (The URN is abbreviated because the file comes from the [PerseusDL/canonical-greekLit](https://github.com/PerseusDL/canonical-greekLit/) repo on GitHub.)

Ideally, we would be able to request these texts from an API, but as of this writing in August 2024, all of the known APIs are not working. So for now, we will parse these files locally and transform them into data structures that facilitate our analyses.

We'll first need to install the MyCapytains library.

In [None]:
%pip install MyCapytain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/pletcher/code/classes/quant-text-analysis/.venv/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Then we can import this module and use it to ingest the text of Pausanias, stored in the `tei/` directory of this repo.

In [1]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText

with open("../tei/tlg0525.tlg001.perseus-eng2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-eng2", resource=f)

Let's try turning the text into just a [Pandas DataFrame](https://pandas.pydata.org/docs/index.html) with columns for the CTS URN, the corresponding XML, and the unannotated text of the passage.

In [2]:
# this block might take a while

from lxml import etree
from MyCapytain.common.constants import Mimetypes

urns = []
raw_xmls = []
unannotated_strings = []

for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

In [7]:
import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

## Working with textual data

Now that we have some text to work with -- and by "some text," I mean all 3170 sections of Pausanias in the above DataFrame -- we can start working with the data.

Before doing so, however, we should ask _how_ we're going to make the data more manageable -- it isn't exactly feasible to dive headfirst into a corpus of this size.

> Discuss: What units can we break Pausanias down into to make it more manageable? Don't worry about how you would do it in code yet, just think about how you might explore the units of the text.


## Types of words

With all languages, but especially with heavily-inflected languages like ancient Greek and Latin, it is important to be precise about the kinds of word forms that we're dealing with.

#### Tokens 

A **token** or **running word** "is a single occurrence of a word form in the text" [@Brezina2018 39].

How can we count the number of tokens in all of Pausanias? First we need to **tokenize** the `unannotated_strings` column of `pausanias_df`.

Tokenization is a surprisingly complicated process depending on the language of study, and we will learn more sophisticated methods for tokenizing Greek text as we go along.

For now, however, let's define a token as "whitespace-delimited text" -- we're not going to worry about punctuation etc. just yet.

So to tokenize the `unannotated_strings` column of `pausanias_df`, we can run:

In [8]:
# See https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html for
# panda's string-splitting utilities; it splits on whitespace by default
pausanias_df['whitespaced_tokens'] = pausanias_df['unannotated_strings'].str.split()

pausanias_df

Unnamed: 0,urn,raw_xml,unannotated_strings,whitespaced_tokens
0,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",On the Greek mainland facing the Cyclades Isla...,"[On, the, Greek, mainland, facing, the, Cyclad..."
1,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","The Peiraeus was a parish from early times, th...","[The, Peiraeus, was, a, parish, from, early, t..."
2,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The most noteworthy sight in the Peiraeus is a...,"[The, most, noteworthy, sight, in, the, Peirae..."
3,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","The Athenians have also another harbor, at Mun...","[The, Athenians, have, also, another, harbor,,..."
4,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",Twenty stades away is the Coliad promontory; o...,"[Twenty, stades, away, is, the, Coliad, promon..."
...,...,...,...,...
3165,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","These, then, live above Amphissa. On the coast...","[These,, then,, live, above, Amphissa., On, th..."
3166,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",I gather that the city got its name from a wom...,"[I, gather, that, the, city, got, its, name, f..."
3167,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The epic poem called the Naupactia by the Gree...,"[The, epic, poem, called, the, Naupactia, by, ..."
3168,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:1...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",Here there is on the coast a temple of Poseido...,"[Here, there, is, on, the, coast, a, temple, o..."


Okay, now that we have some arrays of tokens in `whitespaced_tokens` column, how do we count them?

In [10]:
sum(len(ts) for ts in pausanias_df['whitespaced_tokens'])

272522

> Discuss: What does the above line of code do? gets number of tokens in data frame

The above line of code is not very idiomatic for Pandas, however. Instead, we should write something like the following:

In [15]:
pausanias_df['whitespaced_tokens'].explode().count()

217416

### Types

A **type** is a unique word form in the corpus. For example, the inflected forms βουλεύεται and βουλεύομεν are each a type. (See @Brezina2018 [39-40].)

In other words, **types** are **tokens** grouped by form. So to count the number of **types** in Pausanias, we can do the following:

In [None]:
len(pausanias_df['whitespaced_tokens'].explode().unique())

41363

> Discuss: Break the above line of code down method by method. Answer: Amount of tokens that only appear once in the data frame aka, we have 25716 different types


What if we want to see the top `n` types in the corpus?

In [56]:
#shows us the top 100 types in the data frame
from collections import Counter
#counter is a python library

type_counts = Counter(pausanias_df['whitespaced_tokens'].explode())
#counts the amount of types in dataframe

type_counts.most_common(100)
#finds 100 most common types

[('the', 2075),
 ('of', 1296),
 ('and', 662),
 ('to', 658),
 ('a', 532),
 ('in', 384),
 ('is', 377),
 ('was', 308),
 ('that', 265),
 ('they', 199),
 ('by', 197),
 ('from', 195),
 ('he', 190),
 ('his', 178),
 ('with', 149),
 ('son', 148),
 ('The', 146),
 ('at', 141),
 ('it', 138),
 ('on', 136),
 ('for', 122),
 ('are', 119),
 ('this', 118),
 ('who', 116),
 ('had', 113),
 ('as', 111),
 ('their', 111),
 ('were', 102),
 ('but', 102),
 ('an', 100),
 ('which', 97),
 ('not', 92),
 ('sanctuary', 87),
 ('have', 85),
 ('Lacedaemonians', 84),
 ('I', 83),
 ('called', 83),
 ('when', 79),
 ('also', 78),
 ('image', 68),
 ('be', 64),
 ('them', 61),
 ('him', 61),
 ('say', 61),
 ('been', 58),
 ('there', 55),
 ('one', 53),
 ('made', 51),
 ('On', 50),
 ('against', 50),
 ('after', 48),
 ('temple', 45),
 ('into', 39),
 ('other', 39),
 ('In', 39),
 ('because', 39),
 ('There', 39),
 ('place', 38),
 ('has', 38),
 ('all', 38),
 ('her', 37),
 ('stades', 37),
 ('time', 36),
 ('They', 35),
 ('king', 34),
 ('came', 

### Stop words

Hm, that's not particularly interesting -- most of these words are fairly common and will rank highly in almost any corpus. Further, since we haven't accounted for punctuation, we're probably generating frequencies incorrectly based on whether or not a type is joined to any punctuation. We need to get a bit more sophisticated.

Let's install `spacy` and `grecy` to perform better tokenization and incorporate the notion of a **stop word**: a token that is so common that including it in most statistical analyses will just generate noise.

In [1]:
## Uncomment the line for your system's architecture
%pip install spacy
# %pip install 'spacy[apple]'
%pip install grecy

Collecting thinc-apple-ops<1.0.0,>=0.1.0.dev0 (from spacy[apple])
  Downloading thinc_apple_ops-0.1.5-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.6 kB)
Downloading thinc_apple_ops-0.1.5-cp311-cp311-macosx_11_0_arm64.whl (155 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.5/155.5 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: thinc-apple-ops
Successfully installed thinc-apple-ops-0.1.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/pletcher/code/classes/quant-text-analysis/.venv/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m 

Now we need to install a model for `grecy` to use. Note that there is a known but so-far unpatched issue where this command will only work with Python 3.11.9 and pip 24.0 (or a bit older in either case).

In [19]:
%run -m spacy download en_core_web_sm


Installing grc_proiel_trf.....

Please wait, this could take some minutes.....

Collecting grc-proiel-trf==any
Downloading https://huggingface.co/Jacobo/grc_proiel_trf/resolve/main/grc_proiel_trf-any-py3-none-any.whl (497.5 MB)
[?25l     [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/497.5 MB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/497.5 MB[0m [31m2.0 MB/s[0m eta [36m0:04:07[0m
[2K     [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/497.5 MB[0m [31m5.0 MB/s[0m eta [36m0:01:40[0m
[2K     [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/497.5 MB[0m [31m4.9 MB/s[0m eta [36m0:01:42[0m
[2K     [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/497.5 MB[0m [31m4.8 MB/s[0m eta [36m0:01:44[0m
[2K     [38;5;237m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.9/497.5 MB[0m [31m5.0 MB/s[0m eta [36m0:01:40[0m
[2K     [38;5;237m

In [38]:
import spacy

nlp = spacy.load("en_core_web_sm", disable=["ner"])

In [39]:
tokenizer = nlp.tokenizer

pausanias_df['tokens'] = pausanias_df['unannotated_strings'].apply(tokenizer)

The tokenization process through SpaCy adds some features to each of the tokens in the `tokens` column. Now we can collect the types and exclude stop words using the `token.is_stop` attribute.

In [79]:
types = [t.text for t in pausanias_df['tokens'].explode() if not t.is_stop and t.is_alpha]
#is_alpha is alphabetical characters

type_counts = Counter(types)

type_counts.most_common(100)

[('son', 155),
 ('Lacedaemonians', 108),
 ('sanctuary', 94),
 ('called', 91),
 ('image', 71),
 ('place', 56),
 ('city', 46),
 ('temple', 46),
 ('king', 45),
 ('Heracles', 45),
 ('time', 45),
 ('Sparta', 44),
 ('stades', 42),
 ('Apollo', 37),
 ('said', 37),
 ('sea', 35),
 ('came', 35),
 ('Athena', 35),
 ('war', 34),
 ('sons', 33),
 ('Artemis', 33),
 ('throne', 32),
 ('Athenians', 29),
 ('Agesilaus', 28),
 ('land', 27),
 ('left', 27),
 ('people', 27),
 ('road', 27),
 ('Zeus', 26),
 ('took', 25),
 ('man', 25),
 ('Asclepius', 25),
 ('Cleomenes', 25),
 ('set', 24),
 ('daughter', 24),
 ('brought', 24),
 ('far', 24),
 ('Pausanias', 24),
 ('Agis', 23),
 ('statue', 23),
 ('hero', 22),
 ('Tyndareus', 21),
 ('oracle', 21),
 ('army', 21),
 ('old', 21),
 ('away', 20),
 ('tomb', 20),
 ('house', 20),
 ('battle', 20),
 ('Achilles', 20),
 ('Dionysus', 20),
 ('named', 19),
 ('death', 19),
 ('water', 19),
 ('Lacedaemon', 19),
 ('account', 19),
 ('Argives', 19),
 ('won', 19),
 ('god', 19),
 ('bronze', 19)

Much better! We now have a list of the most common types, exluding stop words and punctuation.

Be careful, though: these are still just raw counts, and they tell us very little about how we might characterize Pausanias vis-à-vis a larger corpus.

## Lemmata/lemmas

A **lemma** (plural **lemmata** or **lemmas**) represents "a group of all inflectional forms related to one stem that belong to the same word class (Kučera & Francis 1967: 1)" [@Brezina2018 40]. In simpler terms, a **lemma** is the dictionary form of a word, so **lemmata** give us a way of further reducing the word count. ἐστίν, ἔσμεν, and εἰσίν all have the same **lemma**: εἰμί.

Lemmatization, as you might guess, often involves additional processing. Luckily, we can use the SpaCy and GreCy models again.

In [41]:
raw_texts = [t for t in pausanias_df['unannotated_strings']]
annotated_texts = nlp.pipe(raw_texts, batch_size=100)

pausanias_df['nlp_docs'] = list(annotated_texts)

In [96]:
lemmata = [t.lemma_ for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lemmata_counts = Counter(lemmata)

lemmata_counts.most_common(100)

[('Messenians', 192),
 ('Lacedaemonians', 143),
 ('son', 142),
 ('man', 119),
 ('come', 83),
 ('Aristomenes', 75),
 ('say', 67),
 ('war', 67),
 ('call', 60),
 ('time', 56),
 ('Messene', 53),
 ('city', 53),
 ('king', 52),
 ('battle', 47),
 ('god', 46),
 ('daughter', 45),
 ('give', 45),
 ('take', 44),
 ('people', 44),
 ('Ithome', 43),
 ('Messenia', 41),
 ('messenian', 40),
 ('bring', 40),
 ('year', 40),
 ('town', 37),
 ('Aristodemus', 37),
 ('country', 36),
 ('receive', 36),
 ('death', 36),
 ('know', 36),
 ('capture', 35),
 ('Arcadians', 34),
 ('land', 33),
 ('great', 32),
 ('account', 31),
 ('see', 31),
 ('day', 30),
 ('troop', 30),
 ('fight', 29),
 ('statue', 29),
 ('place', 29),
 ('send', 29),
 ('kill', 28),
 ('Sparta', 28),
 ('force', 27),
 ('return', 27),
 ('house', 25),
 ('carry', 24),
 ('drive', 24),
 ('temple', 24),
 ('wall', 24),
 ('go', 23),
 ('having', 23),
 ('water', 23),
 ('Zeus', 23),
 ('attack', 23),
 ('join', 23),
 ('Eira', 23),
 ('spring', 22),
 ('child', 22),
 ('Heracle

#### Lexemes

Finally, "a **lexeme** is a lemma with a particular meaning attached to it.... The best way of conceptualizing a lexeme is as a subentry in a dictionary" [@Brezina2018 40].

One challenge of working with lexemes is that, even with the advances of large language models like ChatGPT, there is no surefire way to annotate them automatically. We still need "human-in-the-loop" pipelines to catch errors and ambiguities. And keep in mind that even two humans might disagree on the lexeme for a particular word!

But we can inspect the `lex` attributes of the tokens that SpaCy has generated for us and see if they make sense.

In [55]:
lexemes = [(t.text, t.lex) for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lexeme_counts = Counter(lexemes)

lexeme_counts.most_common(100)

[(('ἐνταῦθα', <spacy.lexeme.Lexeme at 0x39f333c80>), 491),
 (('μάλιστα', <spacy.lexeme.Lexeme at 0x39f333980>), 421),
 (('ἄγαλμα', <spacy.lexeme.Lexeme at 0x392a7dbc0>), 411),
 (('ἱερὸν', <spacy.lexeme.Lexeme at 0x392abfe40>), 369),
 (('ὕστερον', <spacy.lexeme.Lexeme at 0x392a7f840>), 351),
 (('σφισιν', <spacy.lexeme.Lexeme at 0x39f332b00>), 347),
 (('ὄνομα', <spacy.lexeme.Lexeme at 0x392a89180>), 345),
 (('γενέσθαι', <spacy.lexeme.Lexeme at 0x392a88300>), 329),
 (('λέγουσιν', <spacy.lexeme.Lexeme at 0x392a7e240>), 328),
 (('τότε', <spacy.lexeme.Lexeme at 0x392a80980>), 287),
 (('ἐπʼ', <spacy.lexeme.Lexeme at 0x392a851c0>), 268),
 (('πόλιν', <spacy.lexeme.Lexeme at 0x392a7e3c0>), 258),
 (('ἤδη', <spacy.lexeme.Lexeme at 0x392aa1b00>), 257),
 (('σφᾶς', <spacy.lexeme.Lexeme at 0x392a92680>), 250),
 (('φασιν', <spacy.lexeme.Lexeme at 0x392a7d500>), 238),
 (('δʼ', <spacy.lexeme.Lexeme at 0x392ac2a80>), 234),
 (('πρότερον', <spacy.lexeme.Lexeme at 0x39f331d80>), 227),
 (('σφίσιν', <spacy.lex

Wait a second -- this list looks identical to our list of word types.

Sure enough, when we check the SpaCy documentation for [Lexeme](https://spacy.io/api/lexeme):

> A Lexeme has no string context – it’s a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma (if lemmatization depends on the part-of-speech tag).


## Review

> Discuss: Define **token**, **type**, **stop word**, **lemma**, and **lexeme** in your own words.

> Answer: A token is a word that appears in the text; A type word form that appears in the text: ex: 'the' twice in a sentence would be two tokens, one type; stop words are small words like 'and' or 'the' that don't add substance to the text; lemmas are groups of the same form of one stem of a word ex: time, times; lexemmes are lemma with a particular meaning ex: time has different meanings"

> Discuss: How can we use these different notions of "word" in our analysis of corpora? Why is it important to be precise about what kind of word(s) we're using?

> Answer: Tokens give us a base understanding of the amount of words in a text. Types show us any the words that appear in the text. A lemma allow us to reduce the word count even further to look at words with the same stem. Lexemmes allow us to look closely at the meaning distinctions between words to understand them clearly. It's important to be precise, especially in large corpora, because it allows the researcher to parse through the text for interesting trends with more efficiency and accuracy.

## Homework

1. Read @Brezina2018 [ch. 2, pp. 41--65].
2. Choose 3 books of Pausanias and calculate the most common tokens, types, and lemmata for each. In a paragraph or so, describe your findings relative to the work we have done in class today.
3. Using your findings from 2., write a short (1-page) evaluation of one of the books of Pausanias that you have analyzed. Does your qualitative -- which is not to say "subjective" -- experience of reading the text cohere with your quantitative evaluation?

In [66]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText
import pandas as pd
 
with open("../tei/tlg0525.tlg001.perseus-eng2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-eng2", resource=f)
 
from lxml import etree
from MyCapytain.common.constants import Mimetypes
 
urns = []
raw_xmls = []
unannotated_strings = []
 
for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")
 
    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)
 

In [67]:
import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

In [82]:
d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)
 
def get_book_of_pausanias(df: pd.DataFrame, book_n: int):
    return df[df['urn'].str.startswith(f"urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:{book_n}")]
 

pausanias_df = get_book_of_pausanias(pausanias_df, 4)
 #here, i added 'pausanias_df =' because it had to be defined as book 4
 #i hope i did it right

pausanias_df['whitespaced_tokens'] = pausanias_df['unannotated_strings'].str.split()

pausanias_df



Unnamed: 0,urn,raw_xml,unannotated_strings,whitespaced_tokens
895,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The frontier between Messenia and that part of...,"[The, frontier, between, Messenia, and, that, ..."
896,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","Messene, being proud of her origin, for her fa...","[Messene,, being, proud, of, her, origin,, for..."
897,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",Before the battle which the Thebans fought wit...,"[Before, the, battle, which, the, Thebans, fou..."
898,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",He is still more clear when speaking about the...,"[He, is, still, more, clear, when, speaking, a..."
899,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The first rulers then in this country were Pol...,"[The, first, rulers, then, in, this, country, ..."
...,...,...,...,...
1213,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",These cattle must have been of Thessalian stoc...,"[These, cattle, must, have, been, of, Thessali..."
1214,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","Eryx too, who was reigning then in Sicily, pla...","[Eryx, too,, who, was, reigning, then, in, Sic..."
1215,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",But the cattle of Neleus were pastured for the...,"[But, the, cattle, of, Neleus, were, pastured,..."
1216,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:4...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The island of Sphacteria lies in front of the ...,"[The, island, of, Sphacteria, lies, in, front,..."


In [83]:
sum(len(ts) for ts in pausanias_df['whitespaced_tokens'])

26395

In [84]:
#doing it in Pandas
pausanias_df['whitespaced_tokens'].explode().count()

26395

In [87]:
#types in book 4
len(pausanias_df['whitespaced_tokens'].explode().unique())

5310

In [88]:
pausanias_df = get_book_of_pausanias(pausanias_df, 4)
#shows us the top 100 types in the data frame
from collections import Counter
#counter is a python library

type_counts = Counter(pausanias_df['whitespaced_tokens'].explode())
#counts the amount of types in dataframe

type_counts.most_common(100)
#finds 100 most common types

[('the', 2435),
 ('of', 1193),
 ('and', 927),
 ('to', 866),
 ('in', 477),
 ('a', 372),
 ('was', 344),
 ('that', 338),
 ('they', 331),
 ('their', 292),
 ('were', 266),
 ('by', 260),
 ('he', 216),
 ('his', 212),
 ('from', 206),
 ('with', 198),
 ('for', 193),
 ('is', 185),
 ('as', 184),
 ('at', 181),
 ('The', 174),
 ('had', 171),
 ('on', 155),
 ('but', 155),
 ('them', 146),
 ('who', 142),
 ('Messenians', 142),
 ('it', 138),
 ('not', 128),
 ('son', 116),
 ('all', 106),
 ('when', 105),
 ('this', 98),
 ('which', 92),
 ('Lacedaemonians', 90),
 ('be', 81),
 ('have', 71),
 ('him', 68),
 ('no', 67),
 ('men', 65),
 ('been', 63),
 ('But', 63),
 ('I', 60),
 ('They', 59),
 ('an', 59),
 ('Aristomenes', 58),
 ('after', 57),
 ('first', 54),
 ('When', 54),
 ('made', 53),
 ('also', 50),
 ('He', 49),
 ('her', 48),
 ('being', 47),
 ('called', 46),
 ('said', 45),
 ('would', 45),
 ('came', 44),
 ('Messenian', 44),
 ('time', 43),
 ('For', 43),
 ('into', 42),
 ('are', 42),
 ('other', 41),
 ('war', 41),
 ('city

In [89]:
## Uncomment the line for your system's architecture
%pip install spacy
# %pip install 'spacy[apple]'
%pip install grecy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [90]:
%run -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [91]:
import spacy

nlp = spacy.load("en_core_web_sm", disable=["ner"])

In [92]:
tokenizer = nlp.tokenizer

pausanias_df['tokens'] = pausanias_df['unannotated_strings'].apply(tokenizer)

In [98]:
#top 100 types in book 4
types = [t.text for t in pausanias_df['tokens'].explode() if not t.is_stop and t.is_alpha]
#is_alpha is alphabetical characters

type_counts = Counter(types)

type_counts.most_common(100)

[('Messenians', 192),
 ('Lacedaemonians', 143),
 ('son', 120),
 ('Aristomenes', 82),
 ('men', 81),
 ('war', 66),
 ('Messene', 56),
 ('called', 55),
 ('time', 53),
 ('said', 53),
 ('Messenian', 50),
 ('battle', 47),
 ('came', 46),
 ('city', 44),
 ('Ithome', 43),
 ('daughter', 42),
 ('Messenia', 41),
 ('people', 40),
 ('Aristodemus', 37),
 ('man', 37),
 ('country', 36),
 ('death', 36),
 ('Arcadians', 34),
 ('king', 33),
 ('land', 30),
 ('took', 29),
 ('account', 29),
 ('having', 29),
 ('Sparta', 28),
 ('received', 27),
 ('brought', 26),
 ('god', 26),
 ('town', 25),
 ('come', 24),
 ('troops', 24),
 ('Heracles', 23),
 ('gave', 23),
 ('place', 23),
 ('Zeus', 23),
 ('year', 23),
 ('Eira', 23),
 ('sons', 22),
 ('killed', 22),
 ('night', 22),
 ('following', 21),
 ('day', 21),
 ('Homer', 21),
 ('house', 21),
 ('Paus', 21),
 ('cattle', 21),
 ('way', 21),
 ('Greeks', 20),
 ('statue', 20),
 ('saw', 20),
 ('water', 20),
 ('sent', 20),
 ('Lacedaemonian', 20),
 ('Euphaes', 20),
 ('gods', 20),
 ('Pelo

In [99]:
raw_texts = [t for t in pausanias_df['unannotated_strings']]
annotated_texts = nlp.pipe(raw_texts, batch_size=100)

pausanias_df['nlp_docs'] = list(annotated_texts)

In [100]:
lemmata = [t.lemma_ for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lemmata_counts = Counter(lemmata)

lemmata_counts.most_common(100)

[('Messenians', 192),
 ('Lacedaemonians', 143),
 ('son', 142),
 ('man', 119),
 ('come', 83),
 ('Aristomenes', 75),
 ('say', 67),
 ('war', 67),
 ('call', 60),
 ('time', 56),
 ('Messene', 53),
 ('city', 53),
 ('king', 52),
 ('battle', 47),
 ('god', 46),
 ('daughter', 45),
 ('give', 45),
 ('take', 44),
 ('people', 44),
 ('Ithome', 43),
 ('Messenia', 41),
 ('messenian', 40),
 ('bring', 40),
 ('year', 40),
 ('town', 37),
 ('Aristodemus', 37),
 ('country', 36),
 ('receive', 36),
 ('death', 36),
 ('know', 36),
 ('capture', 35),
 ('Arcadians', 34),
 ('land', 33),
 ('great', 32),
 ('account', 31),
 ('see', 31),
 ('day', 30),
 ('troop', 30),
 ('fight', 29),
 ('statue', 29),
 ('place', 29),
 ('send', 29),
 ('kill', 28),
 ('Sparta', 28),
 ('force', 27),
 ('return', 27),
 ('house', 25),
 ('carry', 24),
 ('drive', 24),
 ('temple', 24),
 ('wall', 24),
 ('go', 23),
 ('having', 23),
 ('water', 23),
 ('Zeus', 23),
 ('attack', 23),
 ('join', 23),
 ('Eira', 23),
 ('spring', 22),
 ('child', 22),
 ('Heracle

In [101]:
lexemes = [(t.text, t.lex) for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lexeme_counts = Counter(lexemes)

lexeme_counts.most_common(100)

[(('Messenians', <spacy.lexeme.Lexeme at 0x1301806c0>), 192),
 (('Lacedaemonians', <spacy.lexeme.Lexeme at 0x130182740>), 143),
 (('son', <spacy.lexeme.Lexeme at 0x13017ed00>), 120),
 (('Aristomenes', <spacy.lexeme.Lexeme at 0x1300b8880>), 82),
 (('men', <spacy.lexeme.Lexeme at 0x1301801c0>), 81),
 (('war', <spacy.lexeme.Lexeme at 0x130198500>), 66),
 (('Messene', <spacy.lexeme.Lexeme at 0x13017e900>), 56),
 (('called', <spacy.lexeme.Lexeme at 0x13017c480>), 55),
 (('time', <spacy.lexeme.Lexeme at 0x13017c540>), 53),
 (('said', <spacy.lexeme.Lexeme at 0x130186f00>), 53),
 (('Messenian', <spacy.lexeme.Lexeme at 0x130180040>), 50),
 (('battle', <spacy.lexeme.Lexeme at 0x130182e00>), 47),
 (('came', <spacy.lexeme.Lexeme at 0x130183900>), 46),
 (('city', <spacy.lexeme.Lexeme at 0x130182200>), 44),
 (('Ithome', <spacy.lexeme.Lexeme at 0x130181e40>), 43),
 (('daughter', <spacy.lexeme.Lexeme at 0x13017ea80>), 42),
 (('Messenia', <spacy.lexeme.Lexeme at 0x13017d400>), 41),
 (('people', <spacy.

> Book 4: In book 4, there are 26395 tokens and 5310 types. The most common types in book 4 include words like "the", "of", "and" and "to", which is not surprising. However, after excluding stop words, we see that the most common types are "Messenias", "Lacedaemonians", "son", "Aristomenes" and "men." The lemmata for the most common types are very similar: "Messenians", "Lacedaemonians", "son", "man", and "come." "Come" is the only new token, as man and men are the same lemma. The 5 most common Lexemes in book 4 are the same as the 5 most common types.

In [1]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText
import pandas as pd
 
with open("../tei/tlg0525.tlg001.perseus-eng2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-eng2", resource=f)
 
from lxml import etree
from MyCapytain.common.constants import Mimetypes
 
urns = []
raw_xmls = []
unannotated_strings = []
 
for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")
 
    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)
 

In [2]:
import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

In [3]:
d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)
 
def get_book_of_pausanias(df: pd.DataFrame, book_n: int):
    return df[df['urn'].str.startswith(f"urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:{book_n}")]
 

pausanias_df = get_book_of_pausanias(pausanias_df, 5)
 #here, i added 'pausanias_df =' because it had to be defined as book 4
 #i hope i did it right

pausanias_df['whitespaced_tokens'] = pausanias_df['unannotated_strings'].str.split()

pausanias_df



Unnamed: 0,urn,raw_xml,unannotated_strings,whitespaced_tokens
1218,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The Greeks who say that the Peloponnesus has f...,"[The, Greeks, who, say, that, the, Peloponnesu..."
1219,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The rest of Peloponnesus belongs to immigrants...,"[The, rest, of, Peloponnesus, belongs, to, imm..."
1220,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The Eleans we know crossed over from Calydon a...,"[The, Eleans, we, know, crossed, over, from, C..."
1221,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","The Moon, they say, fell in love with this End...","[The, Moon,, they, say,, fell, in, love, with,..."
1222,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",Of his brothers they say that Aetolus remained...,"[Of, his, brothers, they, say, that, Aetolus, ..."
...,...,...,...,...
1475,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","The Hermes carrying the ram under his arm, wit...","[The, Hermes, carrying, the, ram, under, his, ..."
1476,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",Of the bronze oxen one was dedicated by the Co...,"[Of, the, bronze, oxen, one, was, dedicated, b..."
1477,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",Sitting under this ox a little boy was playing...,"[Sitting, under, this, ox, a, little, boy, was..."
1478,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:5...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","Under the plane trees in the Altis, just about...","[Under, the, plane, trees, in, the, Altis,, ju..."


In [4]:
#doing it in Pandas
pausanias_df['whitespaced_tokens'].explode().count()

22840

In [5]:
#types in book 5
len(pausanias_df['whitespaced_tokens'].explode().unique())

5084

In [6]:
pausanias_df = get_book_of_pausanias(pausanias_df, 5)
#shows us the top 100 types in the data frame
from collections import Counter
#counter is a python library

type_counts = Counter(pausanias_df['whitespaced_tokens'].explode())
#counts the amount of types in dataframe

type_counts.most_common(100)
#finds 100 most common types

[('the', 2224),
 ('of', 1245),
 ('and', 694),
 ('to', 531),
 ('a', 454),
 ('is', 451),
 ('in', 370),
 ('that', 247),
 ('by', 243),
 ('was', 236),
 ('from', 204),
 ('on', 201),
 ('are', 194),
 ('The', 190),
 ('they', 181),
 ('with', 174),
 ('at', 167),
 ('for', 159),
 ('it', 141),
 ('his', 136),
 ('as', 120),
 ('he', 120),
 ('an', 120),
 ('who', 106),
 ('not', 105),
 ('were', 104),
 ('this', 102),
 ('I', 98),
 ('but', 96),
 ('have', 96),
 ('which', 88),
 ('one', 86),
 ('their', 85),
 ('son', 78),
 ('also', 74),
 ('made', 69),
 ('called', 68),
 ('be', 68),
 ('Eleans', 65),
 ('Zeus', 65),
 ('been', 65),
 ('altar', 63),
 ('dedicated', 61),
 ('has', 60),
 ('other', 60),
 ('them', 60),
 ('had', 59),
 ('after', 53),
 ('there', 48),
 ('him', 47),
 ('image', 47),
 ('say', 46),
 ('It', 45),
 ('her', 45),
 ('There', 44),
 ('first', 43),
 ('when', 41),
 ('two', 41),
 ('Heracles', 40),
 ('too', 40),
 ('all', 38),
 ('On', 37),
 ('about', 37),
 ('This', 35),
 ('upon', 35),
 ('Olympia', 34),
 ('no', 3

In [7]:
## Uncomment the line for your system's architecture
%pip install spacy
# %pip install 'spacy[apple]'
%pip install grecy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [8]:
%run -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
import spacy

nlp = spacy.load("en_core_web_sm", disable=["ner"])

In [10]:
tokenizer = nlp.tokenizer

pausanias_df['tokens'] = pausanias_df['unannotated_strings'].apply(tokenizer)

In [11]:
#top 100 types in book 4
types = [t.text for t in pausanias_df['tokens'].explode() if not t.is_stop and t.is_alpha]
#is_alpha is alphabetical characters

type_counts = Counter(types)

type_counts.most_common(100)

[('Zeus', 105),
 ('Eleans', 90),
 ('son', 82),
 ('called', 72),
 ('altar', 70),
 ('dedicated', 65),
 ('Heracles', 64),
 ('Olympia', 61),
 ('image', 55),
 ('Elis', 54),
 ('games', 44),
 ('said', 36),
 ('inscription', 36),
 ('land', 35),
 ('race', 35),
 ('temple', 35),
 ('offerings', 33),
 ('Pelops', 31),
 ('right', 31),
 ('Olympic', 31),
 ('Festival', 31),
 ('set', 30),
 ('time', 29),
 ('won', 29),
 ('city', 29),
 ('river', 28),
 ('images', 28),
 ('god', 27),
 ('came', 25),
 ('Altis', 25),
 ('bronze', 25),
 ('Greeks', 24),
 ('left', 24),
 ('man', 23),
 ('war', 23),
 ('Alpheius', 23),
 ('stands', 23),
 ('account', 22),
 ('men', 22),
 ('feet', 22),
 ('chariot', 21),
 ('offering', 21),
 ('people', 20),
 ('sacrifice', 20),
 ('day', 20),
 ('sanctuary', 20),
 ('ancient', 20),
 ('horses', 20),
 ('sons', 19),
 ('throne', 19),
 ('sea', 19),
 ('Artemis', 19),
 ('olive', 19),
 ('number', 18),
 ('Lacedaemonians', 18),
 ('hand', 17),
 ('present', 17),
 ('near', 17),
 ('held', 17),
 ('boys', 17),
 ('

In [12]:
raw_texts = [t for t in pausanias_df['unannotated_strings']]
annotated_texts = nlp.pipe(raw_texts, batch_size=100)

pausanias_df['nlp_docs'] = list(annotated_texts)

In [13]:
lemmata = [t.lemma_ for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lemmata_counts = Counter(lemmata)

lemmata_counts.most_common(100)

[('Zeus', 105),
 ('son', 101),
 ('Eleans', 90),
 ('altar', 85),
 ('image', 83),
 ('call', 77),
 ('dedicate', 66),
 ('Olympia', 61),
 ('Heracles', 55),
 ('Elis', 54),
 ('offering', 54),
 ('come', 52),
 ('say', 51),
 ('hold', 46),
 ('man', 46),
 ('inscription', 46),
 ('race', 45),
 ('game', 45),
 ('stand', 44),
 ('god', 43),
 ('city', 42),
 ('temple', 36),
 ('foot', 36),
 ('land', 35),
 ('sacrifice', 35),
 ('horse', 35),
 ('time', 34),
 ('win', 34),
 ('right', 34),
 ('give', 33),
 ('set', 32),
 ('river', 32),
 ('woman', 32),
 ('Pelops', 31),
 ('Festival', 31),
 ('great', 28),
 ('chariot', 28),
 ('account', 26),
 ('statue', 26),
 ('figure', 26),
 ('daughter', 25),
 ('war', 25),
 ('Altis', 25),
 ('bronze', 25),
 ('Greeks', 24),
 ('victory', 24),
 ('take', 23),
 ('day', 23),
 ('Alpheius', 23),
 ('run', 22),
 ('near', 22),
 ('boy', 22),
 ('sanctuary', 22),
 ('carry', 22),
 ('work', 22),
 ('hand', 21),
 ('follow', 21),
 ('people', 21),
 ('grow', 21),
 ('throne', 20),
 ('enter', 20),
 ('ancien

In [14]:
lexemes = [(t.text, t.lex) for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lexeme_counts = Counter(lexemes)

lexeme_counts.most_common(100)

[(('Zeus', <spacy.lexeme.Lexeme at 0x11dfaaa80>), 105),
 (('Eleans', <spacy.lexeme.Lexeme at 0x11dfa87c0>), 90),
 (('son', <spacy.lexeme.Lexeme at 0x11dfaaa00>), 82),
 (('called', <spacy.lexeme.Lexeme at 0x11dfa9100>), 72),
 (('altar', <spacy.lexeme.Lexeme at 0x11e84c080>), 70),
 (('dedicated', <spacy.lexeme.Lexeme at 0x11e828640>), 65),
 (('Heracles', <spacy.lexeme.Lexeme at 0x11e820940>), 64),
 (('Olympia', <spacy.lexeme.Lexeme at 0x11e618340>), 61),
 (('image', <spacy.lexeme.Lexeme at 0x11e828880>), 55),
 (('Elis', <spacy.lexeme.Lexeme at 0x11e820d00>), 54),
 (('games', <spacy.lexeme.Lexeme at 0x11e81e2c0>), 44),
 (('said', <spacy.lexeme.Lexeme at 0x11e81c880>), 36),
 (('inscription', <spacy.lexeme.Lexeme at 0x11e82c580>), 36),
 (('land', <spacy.lexeme.Lexeme at 0x11dfa8d00>), 35),
 (('race', <spacy.lexeme.Lexeme at 0x11e618280>), 35),
 (('temple', <spacy.lexeme.Lexeme at 0x11e81cac0>), 35),
 (('offerings', <spacy.lexeme.Lexeme at 0x11e8c4800>), 33),
 (('Pelops', <spacy.lexeme.Lexem

> Book 5: In book 5 there are 22840 tokens and 5084 types. The most common types are "the", "of", "and", "to", and "a," which makes sense as they are stop words. When filtering out stop words, the most common types are "Zeus", "Eleans", "son", "called", and "altar." The top 5 lemmata are "Zeus", "son", "Eleans", "altar", and "image." The top 5 lexemmes are the same as the top 5 types.

In [15]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText
import pandas as pd
 
with open("../tei/tlg0525.tlg001.perseus-eng2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-eng2", resource=f)
 
from lxml import etree
from MyCapytain.common.constants import Mimetypes
 
urns = []
raw_xmls = []
unannotated_strings = []
 
for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")
 
    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)
 

In [16]:
import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

In [18]:
d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

#(d) means on the parameters that we defined in the first 5 lines of code
 
def get_book_of_pausanias(df: pd.DataFrame, book_n: int):
    return df[df['urn'].str.startswith(f"urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:{book_n}")]

pausanias_df = get_book_of_pausanias(pausanias_df, 6)
 #here, i added 'pausanias_df =' because it had to be defined as book 4
 #i hope i did it right

pausanias_df['whitespaced_tokens'] = pausanias_df['unannotated_strings'].str.split()

pausanias_df

Unnamed: 0,urn,raw_xml,unannotated_strings,whitespaced_tokens
1480,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",After my description of the votive offerings I...,"[After, my, description, of, the, votive, offe..."
1481,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",These I am forced to omit by the nature of my ...,"[These, I, am, forced, to, omit, by, the, natu..."
1482,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",On the right of the temple of Hera is the stat...,"[On, the, right, of, the, temple, of, Hera, is..."
1483,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",The inscription on Cleogenes the son of Silenu...,"[The, inscription, on, Cleogenes, the, son, of..."
1484,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...",After this the Eleans passed a law that in fut...,"[After, this, the, Eleans, passed, a, law, tha..."
...,...,...,...,...
1742,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","The land of Elis is fruitful, being especially...","[The, land, of, Elis, is, fruitful,, being, es..."
1743,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","Its size is twice that of the largest beetle, ...","[Its, size, is, twice, that, of, the, largest,..."
1744,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","They keep them for four years, feeding them on...","[They, keep, them, for, four, years,, feeding,..."
1745,urn:cts:greekLit:tlg0525.tlg001.perseus-eng2:6...,"<TEI xmlns=""http://www.tei-c.org/ns/1.0"" xmlns...","But I have heard that it is not the Red Sea, b...","[But, I, have, heard, that, it, is, not, the, ..."


In [19]:
sum(len(ts) for ts in pausanias_df['whitespaced_tokens'])

21034

In [20]:
#doing it in Pandas
pausanias_df['whitespaced_tokens'].explode().count()

21034

In [21]:
#types in book 6
len(pausanias_df['whitespaced_tokens'].explode().unique())

4515

In [22]:
pausanias_df = get_book_of_pausanias(pausanias_df, 6)
#shows us the top 100 types in the data frame
from collections import Counter
#counter is a python library

type_counts = Counter(pausanias_df['whitespaced_tokens'].explode())
#counts the amount of types in dataframe

type_counts.most_common(100)
#finds 100 most common types

[('the', 2060),
 ('of', 1285),
 ('and', 629),
 ('a', 471),
 ('to', 427),
 ('in', 422),
 ('was', 309),
 ('is', 297),
 ('at', 269),
 ('by', 267),
 ('his', 214),
 ('that', 212),
 ('The', 187),
 ('he', 176),
 ('who', 170),
 ('for', 170),
 ('statue', 164),
 ('son', 156),
 ('from', 146),
 ('on', 137),
 ('won', 126),
 ('it', 126),
 ('are', 123),
 ('with', 120),
 ('they', 106),
 ('as', 103),
 ('made', 103),
 ('but', 85),
 ('not', 82),
 ('were', 82),
 ('this', 81),
 ('also', 78),
 ('have', 75),
 ('an', 75),
 ('their', 66),
 ('one', 64),
 ('be', 63),
 ('dedicated', 63),
 ('I', 61),
 ('him', 59),
 ('had', 58),
 ('Olympia', 53),
 ('when', 51),
 ('victory', 49),
 ('two', 47),
 ('which', 45),
 ('after', 43),
 ('them', 43),
 ('Eleans', 40),
 ('victories', 40),
 ('been', 38),
 ('inscription', 37),
 ('name', 37),
 ('There', 37),
 ('being', 36),
 ('statues', 35),
 ('all', 35),
 ('other', 35),
 ('called', 35),
 ('man', 35),
 ('because', 34),
 ('He', 33),
 ("boys'", 33),
 ('no', 31),
 ('among', 31),
 ('Ol

In [23]:
## Uncomment the line for your system's architecture
%pip install spacy
# %pip install 'spacy[apple]'
%pip install grecy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [24]:
%run -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [25]:
import spacy

nlp = spacy.load("en_core_web_sm", disable=["ner"])

In [26]:
tokenizer = nlp.tokenizer

pausanias_df['tokens'] = pausanias_df['unannotated_strings'].apply(tokenizer)

In [27]:
#top 100 types in book 6
types = [t.text for t in pausanias_df['tokens'].explode() if not t.is_stop and t.is_alpha]
#is_alpha is alphabetical characters

type_counts = Counter(types)

type_counts.most_common(100)

[('statue', 172),
 ('son', 161),
 ('won', 128),
 ('Olympia', 89),
 ('race', 69),
 ('Eleans', 67),
 ('boys', 65),
 ('dedicated', 64),
 ('Elis', 63),
 ('victory', 56),
 ('men', 53),
 ('boxing', 53),
 ('victories', 52),
 ('match', 52),
 ('man', 45),
 ('inscription', 42),
 ('chariot', 42),
 ('called', 40),
 ('wrestling', 39),
 ('statues', 36),
 ('games', 35),
 ('time', 33),
 ('place', 32),
 ('said', 32),
 ('Olympic', 30),
 ('set', 29),
 ('pancratium', 29),
 ('stands', 29),
 ('horses', 28),
 ('sanctuary', 28),
 ('people', 27),
 ('work', 25),
 ('Elean', 24),
 ('horse', 24),
 ('pentathlum', 24),
 ('crown', 23),
 ('father', 23),
 ('foot', 23),
 ('proclaimed', 22),
 ('victor', 22),
 ('image', 22),
 ('land', 22),
 ('river', 22),
 ('Sicyon', 21),
 ('came', 21),
 ('temple', 20),
 ('native', 20),
 ('Pytho', 20),
 ('Nemea', 20),
 ('old', 20),
 ('city', 20),
 ('boy', 19),
 ('bronze', 19),
 ('received', 18),
 ('Festival', 18),
 ('Heracles', 18),
 ('day', 18),
 ('brought', 17),
 ('Arcadians', 17),
 ('E

In [28]:
raw_texts = [t for t in pausanias_df['unannotated_strings']]
annotated_texts = nlp.pipe(raw_texts, batch_size=100)

pausanias_df['nlp_docs'] = list(annotated_texts)

In [29]:
lemmata = [t.lemma_ for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lemmata_counts = Counter(lemmata)

lemmata_counts.most_common(100)

[('statue', 209),
 ('son', 176),
 ('win', 145),
 ('victory', 109),
 ('man', 100),
 ('Olympia', 89),
 ('boy', 84),
 ('race', 76),
 ('Eleans', 67),
 ('dedicate', 65),
 ('Elis', 63),
 ('match', 54),
 ('horse', 52),
 ('stand', 50),
 ('say', 48),
 ('chariot', 46),
 ('inscription', 45),
 ('time', 42),
 ('boxing', 41),
 ('call', 41),
 ('place', 37),
 ('come', 36),
 ('game', 35),
 ('crown', 34),
 ('set', 33),
 ('hold', 33),
 ('wrestling', 33),
 ('work', 31),
 ('foot', 30),
 ('victor', 29),
 ('pancratium', 29),
 ('image', 29),
 ('city', 29),
 ('sanctuary', 29),
 ('people', 28),
 ('native', 26),
 ('know', 24),
 ('pentathlum', 24),
 ('proclaim', 23),
 ('father', 23),
 ('land', 23),
 ('receive', 22),
 ('god', 22),
 ('river', 22),
 ('treasury', 22),
 ('olympic', 21),
 ('Sicyon', 21),
 ('bring', 21),
 ('old', 21),
 ('great', 21),
 ('athlete', 20),
 ('temple', 20),
 ('Pytho', 20),
 ('Nemea', 20),
 ('enter', 20),
 ('year', 20),
 ('take', 19),
 ('bronze', 19),
 ('day', 19),
 ('umpire', 18),
 ('near', 1

In [30]:
lexemes = [(t.text, t.lex) for t in pausanias_df['nlp_docs'].explode() if not t.is_stop and t.is_alpha]

lexeme_counts = Counter(lexemes)

lexeme_counts.most_common(100)

[(('statue', <spacy.lexeme.Lexeme at 0x11d4afd40>), 172),
 (('son', <spacy.lexeme.Lexeme at 0x11d4dc900>), 161),
 (('won', <spacy.lexeme.Lexeme at 0x11d4d6c80>), 128),
 (('Olympia', <spacy.lexeme.Lexeme at 0x11dfae0c0>), 89),
 (('race', <spacy.lexeme.Lexeme at 0x11d4bbf80>), 69),
 (('Eleans', <spacy.lexeme.Lexeme at 0x11d4a1640>), 67),
 (('boys', <spacy.lexeme.Lexeme at 0x11d4de800>), 65),
 (('dedicated', <spacy.lexeme.Lexeme at 0x11d433fc0>), 64),
 (('Elis', <spacy.lexeme.Lexeme at 0x11e4d1d80>), 63),
 (('victory', <spacy.lexeme.Lexeme at 0x11d4ddf80>), 56),
 (('men', <spacy.lexeme.Lexeme at 0x11d4d9200>), 53),
 (('boxing', <spacy.lexeme.Lexeme at 0x11d4deb80>), 53),
 (('victories', <spacy.lexeme.Lexeme at 0x11d4d7840>), 52),
 (('match', <spacy.lexeme.Lexeme at 0x11d4dedc0>), 52),
 (('man', <spacy.lexeme.Lexeme at 0x11e7a2cc0>), 45),
 (('inscription', <spacy.lexeme.Lexeme at 0x11d4dff40>), 42),
 (('chariot', <spacy.lexeme.Lexeme at 0x11d4a1e00>), 42),
 (('called', <spacy.lexeme.Lexeme

> Book 4: In book 4, there are 26395 tokens and 5310 types. The most common types in book 4 include words like "the", "of", "and" and "to", which is not surprising. However, after excluding stop words, we see that the most common types are "Messenias", "Lacedaemonians", "son", "Aristomenes" and "men." The lemmata for the most common types are very similar: "Messenians", "Lacedaemonians", "son", "man", and "come." "Come" is the only new token, as man and men are the same lemma. The 5 most common Lexemes in book 4 are the same as the 5 most common types.

> Book 5: In book 5 there are 22840 tokens and 5084 types. The most common types are "the", "of", "and", "to", and "a," which makes sense as they are stop words. When filtering out stop words, the most common types are "Zeus", "Eleans", "son", "called", and "altar." The top 5 lemmata are "Zeus", "son", "Eleans", "altar", and "image." The top 5 lexemmes are the same as the top 5 types.

>Book 6: In book 6, there are 21034 tokens and 4515 types. The most common types are "the", "of", "and", "a", and "to," which makes sense as they are stop words. After filtering out stop words the most common types are "statue", "son", "won", "olympia", and "race." The most common lemmata are "statue", "son", "win", "victory", and "man." It is interesting that versions of victory rise to the top of the lemmata. The top 5 lexemmes are the same as the top 5 types.

> HW Paragraph: Upon analyzing books 4, 5, and 6 of the text, I am able to analyze those books with more clarity. Book 4 had the highest amount of tokens and types, at 26395 tokens and 5310. This information tells me that book 4 is longer than books 5 and 6 respectively. In all of the books, stop words are the most common types, which makes sense. In book 4, I can deduce that it is about Messenia given the most common types. In book 5, I assume it focuses on the temple of Zeus and ancient Elis. In book 6, I assume it is based on the site of Olympia. By analyzing the most common types, lemmata, and lexemmes. The difference between type and lexemes was not apparent int his exercise, which makes sense as it is subjective. However, by analyzing the books seperately, I was able to deduce what each one focused on and was about. 

>HW #3
For my homework for this week, one of the books I analyzed was Book four of Pausanias’ “Description of Greece.” I analyzed the English translation of this text. In book 4, there are 26395 tokens and 5310 types. The most common types in Book four include words like "the", "of", "and" and "to", which is not surprising as they are stop words that appear very frequently in the English language. However, after excluding stop words, I found that the most common types are "Messenias", "Lacedaemonians", "son", "Aristomenes" and "men." These types indicate to me that book 4 is about Messenia and Sparta. I also can infer that Book four focuses highly on Aristomenes, the king of Messenia. The most common lemmata are very similar to the most common types: "Messenians", "Lacedaemonians", "son", "man", and "come." "Come" is the only new token, as man and men are the same lemma. This new token does not provide me with much information as it has little substance. The five most common Lexemes in Book four are the same as the five most common types, which makes sense as in SpaCy, lexemes have no part-of-speech tag, dependency parse, or lemma. This quantitative analysis of the text aligns with what I know of Book four. Book four focuses on the history of Messenia, especially its wars. This information aligns with the fact that “war” is the sixth most common type in Book four. The words “battle,” “death,” and "land" are also “land” among the most common types. Overall, the quantitative analysis of this book aligns with the information that I know about the text and what it encapsulates. 