<div class="well" style="margin:1em 2em">
<p>This Notebook reproduces and expands on a demo from “Distant Reading of Direct Speech in Epic: An Illustrated Workflow,” a talk I gave at the FIEC / CA annual meeting in London, July 8, 2019.</p>
</div>


# Heroes and their moms

Let's say we're young scholars interested in Telemachus' speech to Penelope.
 - How often does he speak to her?
 - What kind of language does he use?
 - How does the narrator refer to these speeches?
 
We'll start by showing how the DICES database and Python library can be used to retrieve and manipulate the speeches in question. Then we'll expand our perspective to show how DICES enables research on a "distant reading" scale, taking in all heroes and their mothers. Finally, we'll check the accuracy of the automated methods by comparing against a benchmark of hand-curated mother-child speech data.

## Preliminaries

In [1]:
# this lets me change the api while the notebook is open
%load_ext autoreload
%autoreload 2

In [4]:
import pandas as pd
import re
import ipywidgets as widgets
from IPython.display import display
from collections import Counter
from matplotlib import pyplot
%matplotlib inline

### The DICES API

See example 1 for notes.

In [6]:
from dicesapi import DicesAPI
api = DicesAPI(
    dices_api = 'http://localhost:8000/api',
    cts_api = 'http://cts.perseids.org/api/cts/',
)

### CLTK

Make sure the corpora are present:

In [7]:
from cltk.corpus.utils.importer import CorpusImporter
corpora = [
    '{}_models_cltk',
    '{}_text_perseus',
    '{}_treebank_perseus',
    '{}_lexica_perseus',
]

print('Importing corpora:')

for lang in ['latin', 'greek']:
    downloader = CorpusImporter(lang)
    for corpus in corpora:
        print(" - " + corpus.format(lang))
        downloader.import_corpus(corpus.format(lang))

from cltk.tokenize.word import WordTokenizer
tokenizer = {
    'greek': WordTokenizer('greek'),
    'latin': WordTokenizer('latin'),
}

Importing corpora:
 - latin_models_cltk
 - latin_text_perseus
 - latin_treebank_perseus
 - latin_lexica_perseus
 - greek_models_cltk
 - greek_text_perseus
 - greek_treebank_perseus
 - greek_lexica_perseus


Set up lemmatizers:

In [8]:
from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer
lemmatizer = {
    'greek': BackoffGreekLemmatizer(),
    'latin': BackoffLatinLemmatizer(),    
}

# regular expressions to tidy up perseus texts for ctlk
replacements = {
    'greek': [
        (r'·', ','),           # FIXME: raised dot? 
        (chr(700), chr(8217)), # two different apostrophes that look alike
    ],
    'latin': [
        
    ],
}

# compile the regexes
for lang in ['greek', 'latin']:
    replacements[lang] = [(re.compile(pat), repl) for pat, repl in replacements[lang]]
    

# generic tokenize-lemmatize function
def lemmatize(text, lang):
    '''return a set of (token,lemmata) pairs for a string'''
    
    for pat, repl in replacements[lang]:
        text = pat.sub(repl, text)
    
    tokens = tokenizer[lang].tokenize(text)
    lemmata = lemmatizer[lang].lemmatize(tokens)
    
    return lemmata

### WikiData

In [9]:
from qwikidata.linked_data_interface import get_entity_dict_from_api
from qwikidata.entity import WikidataItem, WikidataProperty

##  Part 1

Let's start by building a lexicon for all the words Telemachus speaks to Penelope.

### Identify and download the speeches

Using the hand-rolled DICES API code, we can search speeches using keywords. For now, JSON results from the API are paged, so if your search has a lot of results, you may have to wait for several pages to download. I've added a progress bar widget because I get impatient.

Note that I can specify both the speaker and the addressee.

In [15]:
speeches = api.getSpeeches(spkr_name='Telemachus', addr_name='Penelope')

What did we get?

In [16]:
for s in speeches:
    print(s)

<Speech: Homer Odyssey 1.346-1.359>
<Speech: Homer Odyssey 17.46-17.56>
<Speech: Homer Odyssey 17.108-17.149>
<Speech: Homer Odyssey 18.227-18.242>
<Speech: Homer Odyssey 21.344-21.353>
<Speech: Homer Odyssey 23.97-23.103>


### Retrieve the passages from a remote library

We have the metadata for each speech; now we need the text. The DICES library uses MyCapytain under the hood to retrieve the passages from a remote CTS server: here, Perseus, specified in the `DicesAPI()` call above.

In [17]:
passages = []
for s in speeches:
    cts_passage = s.getCTS()
    text = cts_passage.text
    passages.append(text)
    
    print(f'{s.author.name} {s.work.title} {s.l_range}')
    print(text)
    print()

Homer Odyssey 1.346-1.359
μῆτερ ἐμή, τί τʼ ἄρα φθονέεις ἐρίηρον ἀοιδὸν τέρπειν ὅππῃ οἱ νόος ὄρνυται; οὔ νύ τʼ ἀοιδοὶ αἴτιοι, ἀλλά ποθι Ζεὺς αἴτιος, ὅς τε δίδωσιν ἀνδράσιν ἀλφηστῇσιν, ὅπως ἐθέλῃσιν, ἑκάστῳ. τούτῳ δʼ οὐ νέμεσις Δαναῶν κακὸν οἶτον ἀείδειν· τὴν γὰρ ἀοιδὴν μᾶλλον ἐπικλείουσʼ ἄνθρωποι, ἥ τις ἀκουόντεσσι νεωτάτη ἀμφιπέληται. σοὶ δʼ ἐπιτολμάτω κραδίη καὶ θυμὸς ἀκούειν· οὐ γὰρ Ὀδυσσεὺς οἶος ἀπώλεσε νόστιμον ἦμαρ ἐν Τροίῃ, πολλοὶ δὲ καὶ ἄλλοι φῶτες ὄλοντο. ἀλλʼ εἰς οἶκον ἰοῦσα τὰ σʼ αὐτῆς ἔργα κόμιζε, ἱστόν τʼ ἠλακάτην τε, καὶ ἀμφιπόλοισι κέλευε ἔργον ἐποίχεσθαι· μῦθος δʼ ἄνδρεσσι μελήσει πᾶσι, μάλιστα δʼ ἐμοί· τοῦ γὰρ κράτος ἔστʼ ἐνὶ οἴκῳ.

Homer Odyssey 17.46-17.56
μῆτερ ἐμή, μή μοι γόον ὄρνυθι μηδέ μοι ἦτορ ἐν στήθεσσιν ὄρινε φυγόντι περ αἰπὺν ὄλεθρον· ἀλλʼ ὑδρηναμένη, καθαρὰ χροῒ εἵμαθʼ ἑλοῦσα, εἰς ὑπερῷʼ ἀναβᾶσα σὺν ἀμφιπόλοισι γυναιξὶν εὔχεο πᾶσι θεοῖσι τεληέσσας ἑκατόμβας ῥέξειν, αἴ κέ ποθι Ζεὺς ἄντιτα ἔργα τελέσσῃ. αὐτὰρ ἐγὼν ἀγορὴν ἐσελεύσομαι, ὄφρα καλέσσω ξεῖνον, ὅτις

### Use CLTK to parse the text

We can use CTLK's tokenizers to break each string into meaningful units -- sentences and/or words. Then we use the backoff lemmatizer to normalize all the inflected forms to dictionary headwords.

I rolled these steps into one convenience function up above. 👉🏻 *One thing to watch out for is that CTLK needs to know what language you're working with, so I've added a kludge to set language based on the author name; really, that should be built into DICES eventually.*

In [18]:
lems = Counter()
for p in passages:
    lang = s.getLang()
    lemmatized = lemmatize(p.lower(), lang)
    
    these_lems = [lem for tok, lem in lemmatized]
    lems.update(these_lems)

Convert the counter to a Pandas data frame for tidier presentation.

In [19]:
results = pd.DataFrame(lems.most_common(), columns=['lemma', 'count'])
results

Unnamed: 0,lemma,count
0,punc,121
1,ὁ,24
2,δέ,24
3,ἐγώ,24
4,καί,20
...,...,...
345,εἰκοστός,1
346,ἔτος,1
347,ἀεί,1
348,στερεός,1


## Part 2

Now let's think more broadly. How typical is this kind of speech? We can use external linked data to find other examples of mother-son conversations in the corpus.

### Some custom code to query WikiData

This lets us ask whether a given addressee belongs to the set of people having a certain relationship to a given speaker. It takes a while to download the WikiData entities, and I had to run this a number of times, so I cached WD data in the respective character objects once it's downloaded.

In [20]:
def checkWD(c):
    '''make sure character has wikidata id'''
    if c.char is not None:
        if c.char.wd is not None:
            if len(c.char.wd.strip()) > 0:
                return c.char.wd.strip()

def checkWDRelation(s, a, relation, cache=None):
    if cache is None:
        cache = {}
    else:
        if (s.id, a.id) in cache:
            return cache[(s.id, a.id)]

    res = False

    if not hasattr(s, 'wd_ent'):
        s.wd_ent = WikidataItem(get_entity_dict_from_api(s.wd))

    claim_group = s.wd_ent.get_truthy_claim_group(relation)

    for claim in claim_group:
        if claim.mainsnak.datavalue is None:
            continue
        if claim.mainsnak.datavalue.value['id'] == a.wd:
            res = True
    
    cache[(s.id, a.id)] = res
    return res

For example, the relation "mother of" has the WikiData ID `'P25'`. Here's how we ask if a given addressee is the mother of a given speaker:

In [21]:
speaker = api.getCharacters(name='Telemachus')[0]
addressee = api.getCharacters(name='Penelope')[0]

print(f'Is {addressee.name} the mother of {speaker.name}?')
print(checkWDRelation(speaker, addressee, 'P25'))

Is Penelope the mother of Telemachus?
True


I also added a separate cache just for the boolean result of checkWDRelation, to save a little more time.

In [22]:
cache_mothers = {}

### Using WikiData to filter the speeches

The DICES dataset includes WikiData ids for most of the characters (not all). The DICES API doesn't let us query WikiData itself, though. For now, the easiest thing for now is just to download all the speeches and character IDs, and then cross reference them against WikiData using its own API.

In [23]:
# download all the speeches: takes a minute
speeches = api.getSpeeches(progress=True)

HBox()

**Check each speaker-addressee pair against WikiData**

What we actually do here is download the WikiData entity for each speaker, if we don't already have it cached. Then we ask the WD entity for its mom(s), and check the WD ID of the addressee against the results.

In [24]:
df = []

# create a progress bar
pbar = widgets.IntProgress(
    value = 0,
    min = 0,
    max = len(speeches),
    bar_style='info',
    orientation='horizontal'
)
pbar_label = widgets.Label(value = f'{pbar.value}/{len(speeches)}')
display(widgets.HBox([pbar, pbar_label]))

for s in speeches:
    if s.spkr is not None and s.addr is not None:
        for spkr in s.spkr:
            spkr_wd = checkWD(spkr)
            if spkr_wd is not None:

                for addr in s.addr:
                    addr_wd = checkWD(addr)
                    if addr_wd is not None:
                        df.append((
                            s.id,
                            s.work.title,
                            s.l_fi,
                            s.l_la,
                            spkr.char.name, spkr_wd, 
                            addr.char.name, addr_wd,
                            checkWDRelation(spkr.char, addr.char, 'P25', cache=cache_mothers),
                            checkWDRelation(addr.char, spkr.char, 'P25', cache=cache_mothers)
                            ))
    pbar.value += 1
    pbar_label.value = f'{pbar.value}/{len(speeches)}'

df = pd.DataFrame(df, columns=['id', 'work', 'l_first', 'l_last', 'spkr', 'sp_wd', 'addr', 'ad_wd', 'sp_is_mom', 'ad_is_mom'])

HBox(children=(IntProgress(value=0, bar_style='info', max=1858), Label(value='0/1858')))

Wikidata redirect detected.  Input entity id=Q3104159. Returned entity id=Q1108130.


🤔 Let's take a look at the results. Here is the complete set of speeches, with the additional attribute `sp_is_mom` if the speaker is the addressee's mother, and `ad_is_mom` if the addressee is the speaker's mother.

As a quick sanity check, the first two speeches in the Argonautica, which were at the top of the list when I ran this, are between Jason and his mother, Alcimede.

In [25]:
df

Unnamed: 0,id,work,l_first,l_last,spkr,sp_wd,addr,ad_wd,sp_is_mom,ad_is_mom
0,1374,Argonautica,1.278,1.291,Alcimede,Q2718542,Jason,Q176758,False,True
1,1375,Argonautica,1.295,1.305,Jason,Q176758,Alcimede,Q2718542,True,False
2,1376,Argonautica,1.332,1.340,Jason,Q176758,Argonauts,Q165510,False,False
3,1377,Argonautica,1.345,1.347,Heracles,Q122248,Argonauts,Q165510,False,False
4,1378,Argonautica,1.351,1.362,Jason,Q176758,Argonauts,Q165510,False,False
...,...,...,...,...,...,...,...,...,...,...
1624,1854,Aeneid,12.872,12.884,Juturna,Q139448,Turnus,Q633549,False,False
1625,1855,Aeneid,12.889,12.893,Aeneas,Q82732,Turnus,Q633549,False,False
1626,1856,Aeneid,12.894,12.895,Turnus,Q633549,Aeneas,Q82732,False,False
1627,1857,Aeneid,12.931,12.938,Turnus,Q633549,Aeneas,Q82732,False,False


Thanks to pandas, we can filter the data frame on the new boolean columns to show only speeches between mother and child.

In [26]:
hits = df.loc[df['sp_is_mom'] | df['ad_is_mom'],
             ['work', 'l_first', 'l_last', 'spkr', 'addr']]
hits

Unnamed: 0,work,l_first,l_last,spkr,addr
0,Argonautica,1.278,1.291,Alcimede,Jason
1,Argonautica,1.295,1.305,Jason,Alcimede
63,Argonautica,3.129,3.144,Aphrodite,Eros
64,Argonautica,3.151,3.153,Aphrodite,Eros
66,Argonautica,3.26,3.267,Chalciope,Argus (son of Phrixus)
151,Iliad,1.352,1.356,Achilles,Thetis
152,Iliad,1.362,1.363,Thetis,Achilles
153,Iliad,1.365,1.412,Achilles,Thetis
154,Iliad,1.414,1.427,Thetis,Achilles
165,Iliad,1.586,1.594,Hephaestus,Hera


Pandas also comes in handy if I wanted to export this data to Excel:

In [None]:
df.to_csv('example.csv')

### Validation

Let's see how well the automated approach worked. We'll load up a hand-corrected list of mother-child speeches and compare.

In [27]:
bench = pd.read_csv('data/moms-bench.csv', dtype=str)
bench

Unnamed: 0,work,l_first,l_last,spkr,addr,notes
0,Iliad,1.352,1.356,Achilles,Thetis,
1,Iliad,1.362,1.363,Thetis,Achilles,
2,Iliad,1.365,1.412,Achilles,Thetis,
3,Iliad,1.414,1.427,Thetis,Achilles,
4,Iliad,1.586,1.594,Hephaestus,Hera,
5,Iliad,5.373,5.374,Dione,Aphrodite,
6,Iliad,5.376,5.38,Aphrodite,Dione,
7,Iliad,5.382,5.415,Dione,Aphrodite,
8,Iliad,6.254,6.262,Hecuba,Hector,
9,Iliad,6.264,6.285,Hector,Hecuba,


Let's look at the union of `hits` and `bench` to see how we did:

In [28]:
results = hits.merge(bench, on=['work', 'l_first'], how='outer', 
                        suffixes=['_h', '_b'], indicator=True)
results[['work', 'l_first', 'spkr_h', 'addr_h', 'spkr_b', 'addr_b', '_merge']]

Unnamed: 0,work,l_first,spkr_h,addr_h,spkr_b,addr_b,_merge
0,Argonautica,1.278,Alcimede,Jason,Alcimede,Jason,both
1,Argonautica,1.295,Jason,Alcimede,Jason,Alcimede,both
2,Argonautica,3.129,Aphrodite,Eros,Aphrodite,Eros,both
3,Argonautica,3.151,Aphrodite,Eros,Aphrodite,Eros,both
4,Argonautica,3.26,Chalciope,Argus (son of Phrixus),Chalciope,Argus,both
5,Iliad,1.352,Achilles,Thetis,Achilles,Thetis,both
6,Iliad,1.362,Thetis,Achilles,Thetis,Achilles,both
7,Iliad,1.365,Achilles,Thetis,Achilles,Thetis,both
8,Iliad,1.414,Thetis,Achilles,Thetis,Achilles,both
9,Iliad,1.586,Hephaestus,Hera,Hephaestus,Hera,both


#### Precision and Recall

In [None]:
true_pos = sum(results['_merge'] == 'both')

p = true_pos / hits.shape[0]
r = true_pos / bench.shape[0]

print(f'Precision: {p:.2f}')
print(f'Recall:    {r:.2f}')

### Discussion

#### The good news
Well, we returned no false positives, and managed to catch 80% of the benchmark set.

#### The bad news
Lets' look a little more closely at the speeches we missed:

In [30]:
missed = results[results['_merge'] == 'right_only'][
                ['work', 'l_first', 'spkr_h', 'addr_h', 'spkr_b', 'addr_b', 'notes']]
missed

Unnamed: 0,work,l_first,spkr_h,addr_h,spkr_b,addr_b,notes
45,Iliad,15.104,,,Hera,gods,several of her children among the gods
46,Iliad,15.115,,,Ares,gods,direct response to earlier speech of his mother
47,Aeneid,1.321,,,Venus (in disguise),Aeneas and Achates,= iuvenes; mother disguised and son is one of ...
48,Aeneid,1.326,,,Aeneas,Venus,in disguise => virgo
49,Aeneid,1.335,,,Venus (in disguise),Aeneas,and Achates: => pl. but primarily Aeneas
50,Aeneid,1.372,,,Aeneas,Venus,"in disguise: is addressed as dea, but denies i..."
51,Aeneid,1.387,,,Venus,Aeneas,
52,Aeneid,1.407,,,Aeneas,Venus,
53,Aeneid,2.594,,,Venus,Aeneas,
54,Aeneid,6.194,,,Aeneas,two doves and Venus,


At a glance, I'd say these fall into two groups:

 1. A conversation in the Iliad between Hera and a group of gods, some of whom were here children
 2. Conversations in the Aeneid between Venus and Aeneas
 
Missing the first group seems somewhat understandable, if the addressee wasn't explicitly named. Missing the second group, on the other hand, is a big problem!

#### Digging a little deeper

First, let's confirm that all these speeches are in the database results.

In [31]:
missed.merge(df, how='left', on=['work', 'l_first'])[[
    'work', 'l_first',                               # keys: work and locus
    'id', 'spkr', 'addr', 'sp_is_mom', 'ad_is_mom',  # cols from df
    'spkr_b', 'addr_b'                               # cols from bench
    
]].groupby('id').agg(lambda x: ', '.join(x.unique())) # for speeches with multiple speakers/
                                                      # addressees: combine on speech id

Unnamed: 0_level_0,work,l_first,spkr,addr,spkr_b,addr_b
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
391,Iliad,15.104,Hera,gods,Hera,gods
392,Iliad,15.115,Ares,gods,Ares,gods
1523,Aeneid,1.321,Venus,"Achates, Aeneas",Venus (in disguise),Aeneas and Achates
1524,Aeneid,1.326,Aeneas,Venus,Aeneas,Venus
1525,Aeneid,1.335,Venus,"Achates, Aeneas",Venus (in disguise),Aeneas
1526,Aeneid,1.372,Aeneas,Venus,Aeneas,Venus
1527,Aeneid,1.387,Venus,Aeneas,Venus,Aeneas
1528,Aeneid,1.407,Aeneas,Venus,Aeneas,Venus
1557,Aeneid,2.594,Venus,Aeneas,Venus,Aeneas
1653,Aeneid,6.194,Aeneas,Venus,Aeneas,two doves and Venus


Yep, all the missed speeches were in the database. The ones involving Hera's children as a collective 'gods' clearly would be difficult to match with this method.

But what happened with Aeneas and Venus?? WikiData must not treat Venus as Aeneas' mother...?

In [None]:
speaker = api.getCharacters(name='Aeneas')[0]
addressee = api.getCharacters(name='Venus')[0]

print(f'Is {addressee.name} the mother of {speaker.name}?')
print(checkWDRelation(speaker, addressee, 'P25'))

So after some head scratching, the problem turns out to be that WikiData has separate entries for [Venus](https://www.wikidata.org/wiki/Q47652) and [Aphrodite](https://www.wikidata.org/wiki/Q35500). Only the latter is listed as a mother of [Aeneas](https://www.wikidata.org/wiki/Q82732).

In [None]:
speaker = api.getCharacters(name='Aeneas')[0]
addressee = api.getCharacters(name='Aphrodite')[0]

print(f'Is {addressee.name} the mother of {speaker.name}?')
print(checkWDRelation(speaker, addressee, 'P25'))

### Takeaways

 - Part of the fault here is ours: the database underlying DICES betrays its diverse origins by a lingering heterogeneity. If we think Aphrodite and Venus are the same character, we'd better make sure we refer to her the same way consistently, or risk weird results.
 
 - WikiData gave us a lot for free -- all of the individual mother-child relationships were in there when we knew where to look.
 
 - But missing Venus was a pretty big "gotcha" in the end.
 
 - If we want to rely on linked open data for high-stakes work, we need resources that are sensitive to the details we care about. We hope that MANTO, because it's specific to Classical myth and hand-curated by domain experts, can help us with problems like when to treat Venus and Aphrodite as independent entities and when to consider them identical.