In [1]:
from pathlib import Path
import sys
sys.path.insert(0, "../scripts")

temp_folder = Path("../intermediate/dead-languages-corpus-20221117")

This notebook assumes you have already extracted the raw data to an intermediate location. If you have not, feel free to uncomment and run the following snippet.

In [2]:
# from convert import extract
# zip_path = Path("../raw/dead-languages-corpus-20221117.zip")
# extract(zip_path=zip_path, temp_folder=temp_folder)

In [3]:
from corpus import load_corpus

extracted_files = [f for f in temp_folder.iterdir()]
corpus = load_corpus(extracted_files)

Parsing language
Parsing head_word
Parsing grammar
Parsing gloss
Parsing lesson
Parsing series
Parsing glossed_text
Parsing element


In [4]:
import re

# First I want to look at all of the glossed texts that have HTML _besides_ <br>

# This captures anything that starts with an open HTML bracket, that isn't followed by "br"
html_not_br = re.compile("<(?!br)")

# print lines matching that description
for id, row in corpus.glossed_text.items():
    if html_not_br.match(row.glossed_text):
        print(f"{id}: {row.glossed_text}")

9171: <font size="-1">[II, 42]</font> - kánikradaj janúṣam prabruvāṇá <br/>
íyarti vā́cam aritéva nā́vam <br/>
sumaṅgálaś ca śakune bhávāsi <br/>
mā́ tvā kā́ cid abhibhā́ víśvyā vidat <br/><br/>
9174: <font size="-1">[X, 58]</font> - yát te yamáṃ vaivasvatám <br/>
máno jagā́ma dūrakám <br/>
tát ta ā́ vartayāmasi <br/>
ihá kṣáyāya jīváse <br/><br/>
9310: <font size="-1">1</font> - Un kad Jesus dsimmis bij Betlemē eeksch Juddo Semmes, tha Ķehniņa Erodus Laikā, redsi, tad nahze Gudree no Austruma Semmes us Jerusalemi, un sazzija:
9311: <font size="-1">2</font> - Kur irr tas peedsimmis Ķehniņsch tho Juddo? jo mehs essam viņņa Swaigsni redsejschi Austruma Semmē, un nahkuschi to peeluhgt.
9312: <font size="-1">3</font> - Kad tas Ķehniņsch Erodus to dsirdeja, issabijajahs viņsch, un vissa Jerusaleme ar viņņa.
9313: <font size="-1">4</font> - Un saaizinajs vissus Augstus Preesteŗus un Raksta Mahzitajus starp teem Ļaudim isklausija viņsch no teem: Kur Kristum bij dsimt?
9314: <font size="-1">5<

In [12]:
# That's a lot. I see a lot of line number things at the beginning of these texts. Let's see if we can capture those.

# This captures anything that appears at the beginning of the text, that is either a <font></font> or <i></i> tag,
# followed by " - "
line_number = re.compile("^<((font[^<]+</font>)|(i[^<]+</i>)) (- )?")

# print lines matching that description
for id, row in corpus.glossed_text.items():
    if line_number.match(row.glossed_text):
        print(f"{id}: {row.glossed_text}")

9171: <font size="-1">[II, 42]</font> - kánikradaj janúṣam prabruvāṇá <br/>
íyarti vā́cam aritéva nā́vam <br/>
sumaṅgálaś ca śakune bhávāsi <br/>
mā́ tvā kā́ cid abhibhā́ víśvyā vidat <br/><br/>
9174: <font size="-1">[X, 58]</font> - yát te yamáṃ vaivasvatám <br/>
máno jagā́ma dūrakám <br/>
tát ta ā́ vartayāmasi <br/>
ihá kṣáyāya jīváse <br/><br/>
9310: <font size="-1">1</font> - Un kad Jesus dsimmis bij Betlemē eeksch Juddo Semmes, tha Ķehniņa Erodus Laikā, redsi, tad nahze Gudree no Austruma Semmes us Jerusalemi, un sazzija:
9311: <font size="-1">2</font> - Kur irr tas peedsimmis Ķehniņsch tho Juddo? jo mehs essam viņņa Swaigsni redsejschi Austruma Semmē, un nahkuschi to peeluhgt.
9312: <font size="-1">3</font> - Kad tas Ķehniņsch Erodus to dsirdeja, issabijajahs viņsch, un vissa Jerusaleme ar viņņa.
9313: <font size="-1">4</font> - Un saaizinajs vissus Augstus Preesteŗus un Raksta Mahzitajus starp teem Ļaudim isklausija viņsch no teem: Kur Kristum bij dsimt?
9314: <font size="-1">5<

In [18]:
# Perfect. Let's get rid of those, and the <br/> tags, and see what's left.
br = re.compile("<br ?/?> ?")

cleaned = {}
for id, row in corpus.glossed_text.items():
    text = row.glossed_text
    text = line_number.sub("", text)
    text = br.sub("", text)
    cleaned[id] = text

print(f"{id}: {corpus.glossed_text[id].glossed_text}")
print("---becomes---")
print(f"{id}: {cleaned[id]}")

11349: <font size="-1">16</font> guardai in alto, e vidi le sue spalle <br> vestite già de’ raggi del pianeta <br> che mena dritto altrui per ogne calle. <br><br>
---becomes---
11349: guardai in alto, e vidi le sue spalle vestite già de’ raggi del pianeta che mena dritto altrui per ogne calle. 


In [19]:
# What HTML tags are left?
for id, text in cleaned.items():
    if "<" in text:
        print(f"{id}: {text}")

8817: Qui positus forte in statione pontis. [<i lang="en">Sentences omitted at this point.</i>]
8846: Atque ob eam causam, qui sunt adfecti gravioribus morbis quique in proeliis periculisque versantur, aut pro victimis homines immolant aut se immolaturos vovent, administrisque ad ea sacrificia druidibus utuntur. [<i lang="en">Section omitted at this point.</i>]
9531: <sup>M</sup>A-ni-it-ta DUMU <sup>M</sup>Pi-it-ha-a-na LUGAL <sup>URU</sup>Ku-us-sa-ra <i>QÍ-BÍ-MA</i> 
9532: ne-pi-is-za-as-ta <sup>D</sup>IŠKUR-un-ni a-as-su-us e-es-ta 
9533: na-as-ta <sup>D</sup>IŠKUR-un-ni-ma ma-a-an a-as-su-us e-es-ta <sup>URU</sup>Ne-e-sa-as LUGAL-us <sup>URU</sup>Ku-us-sa-ra-as LUGAL-i ... 
9534: LUGAL <sup>URU</sup>Ku-us-sa-ra URU-az kat-ta pa-an-ga-ri-it ú-e-et nu <sup>URU</sup>Ne-e-sa-an is-pa-an-di na-ak-ki-it da-a-as 
9535: <sup>URU</sup>Ne-e-sa-as LUGAL-un <i>IṢ-BAT</i> <i>Ù</i> DUMU<sup>MEŠ</sup> <sup>URU</sup>Ne-e-sa-as i-da-a-lu na-at-ta ku-e-da-ni-ik-ki tak-ki-is-ta 
9537: nu <sup>M</sup>P

There are a few more obvous cleanups we can do.

```
[<i lang="en">Sentences omitted at this point.</i>]
```
and
```
[<i lang="en">Section omitted at this point.</i>]
```

are easy to catch.

This pattern is trickier, but we can get it
```
1248b-1278.)

<p>1248b.)
```
and the following `</p>`

And finally, `<div>` and `</div>` can be removed.

Everything else looks like important grammatical markers.

In [21]:
omit = re.compile("\[<i[^<]+omitted[^>]+>\]")
p = re.compile("^[^<]+<p>[^)]+\) ")
close_p = re.compile("</p>")
div = re.compile("</?div>")

# first, make sure these are catching what we want to catch
print("OMIT\n---------")
for id, text in cleaned.items():
    matches = omit.findall(text)
    if matches:
        print(f"{id} ({matches}): {text}")
        
print("P\n---------")
for id, text in cleaned.items():
    matches = p.findall(text)
    if matches:
        print(f"{id} ({matches}): {text}")
        
print("CLOSE_P\n---------")
for id, text in cleaned.items():
    matches = close_p.findall(text)
    if matches:
        print(f"{id} ({matches}): {text}")
        
print("DIV\n---------")
for id, text in cleaned.items():
    matches = div.findall(text)
    if matches:
        print(f"{id} ({matches}): {text}")

OMIT
---------
8817 (['[<i lang="en">Sentences omitted at this point.</i>]']): Qui positus forte in statione pontis. [<i lang="en">Sentences omitted at this point.</i>]
8846 (['[<i lang="en">Section omitted at this point.</i>]']): Atque ob eam causam, qui sunt adfecti gravioribus morbis quique in proeliis periculisque versantur, aut pro victimis homines immolant aut se immolaturos vovent, administrisque ad ea sacrificia druidibus utuntur. [<i lang="en">Section omitted at this point.</i>]
P
---------
11131 (['537-603a.)\n\n<p>537.) ']): 537-603a.)

<p>537.) Thoh thar than gihuilic hêlag man
Krist antkendi, thoh ni uuarð it gio te thes kuniges hoƀe
them mannun gimârid, thea im an iro môdseƀon
holde ni uuârun, ac uuas im so bihalden forð
mid uuordun endi mid uuerkun, antthat thar uueros ôstan
suîðo glauua gumon gangan quâmun
threa te thero thiodu, thegnos snelle,
an langan uueg oƀar that land tharod:
folgodun ênun berhtun bôkne endi sôhtun that barn godes
mid hluttru hugi: uueldun im hnîg

In [22]:
# Perfect. Let's clean them up
cleaned_2 = {}
for id, row in corpus.glossed_text.items():
    text = row.glossed_text
    text = line_number.sub("", text)
    text = br.sub("", text)
    text = omit.sub("", text)
    text = p.sub("", text)
    text = close_p.sub("", text)
    text = div.sub("", text)
    cleaned_2[id] = text
    
# What HTML tags are left?
for id, text in cleaned_2.items():
    if "<" in text:
        print(f"{id}: {text}")

9531: <sup>M</sup>A-ni-it-ta DUMU <sup>M</sup>Pi-it-ha-a-na LUGAL <sup>URU</sup>Ku-us-sa-ra <i>QÍ-BÍ-MA</i> 
9532: ne-pi-is-za-as-ta <sup>D</sup>IŠKUR-un-ni a-as-su-us e-es-ta 
9533: na-as-ta <sup>D</sup>IŠKUR-un-ni-ma ma-a-an a-as-su-us e-es-ta <sup>URU</sup>Ne-e-sa-as LUGAL-us <sup>URU</sup>Ku-us-sa-ra-as LUGAL-i ... 
9534: LUGAL <sup>URU</sup>Ku-us-sa-ra URU-az kat-ta pa-an-ga-ri-it ú-e-et nu <sup>URU</sup>Ne-e-sa-an is-pa-an-di na-ak-ki-it da-a-as 
9535: <sup>URU</sup>Ne-e-sa-as LUGAL-un <i>IṢ-BAT</i> <i>Ù</i> DUMU<sup>MEŠ</sup> <sup>URU</sup>Ne-e-sa-as i-da-a-lu na-at-ta ku-e-da-ni-ik-ki tak-ki-is-ta 
9537: nu <sup>M</sup>Pi-it-ha-a-na-as at-ta-as-ma-as a-ap-pa-an sa-ni-ya ú-et-ti hu-ul-la-an-za-an hu-ul-la-nu-un 
9538: <sup>D</sup>UTU-az ut-ne-e ku-it ku-it-pat a-ra-is nu-us hu-u-ma-an-du-us-pat hu-ul-la-nu-un 
9539: ka-ru-ú <sup>M</sup>U-uh-na-as LUGAL <sup>URU</sup>Za-a-al-pu-wa <sup>D</sup>Si-ú-sum-mi-in <sup>URU</sup>Ne-e-sa-az <sup>URU</sup>Za-a-al-pu-wa pe-e-da-as 
9540: ap

Winner, winner, chicken dinner. This looks good. We'll still need to clean up whitespace issues, but the HTML has been sorted with these rules.