This notebook is an effort to address this issue: https://github.com/LingResCtr/dead-languages-corpus/issues/2

In [1]:
from pathlib import Path
import sys
sys.path.insert(0, "../scripts")

temp_folder = Path("../intermediate/dead-languages-corpus-20221117")

This notebook assumes you have already extracted the raw data to an intermediate location. If you have not, feel free to uncomment and run the following snippet.

In [2]:
# from convert import extract
# zip_path = Path("../raw/dead-languages-corpus-20221117.zip")
# extract(zip_path=zip_path, temp_folder=temp_folder)

In [4]:
from corpus import Gloss, load_corpus_part

# gloss is a dict (which is insertion ordered by default), mapping gloss id to a Gloss object
gloss = load_corpus_part(Gloss, temp_folder / "eieol_gloss.csv")

In [5]:
n_gloss = len(gloss)

gloss_list = list(gloss.values())

In [16]:
gloss_context = {}
for i in range(n_gloss):
    # if order is missing
    if gloss_list[i].glossed_text_id is None:
        # get the glosses just above and below this one
        gloss_context[gloss_list[i].id] = gloss_list[max(0, i - 1):i + 2]

In [20]:
for gc in list(gloss_context.values())[0]:
    print(gc)

Gloss(id=88954, surface_form='távéd', contextual_gloss='your own', comments='The particle <span class="SanskExt" lang="sa">ít</span> stresses the previous word, here \'your\'.', underlying_form='táva ít', language_id=7, glossed_text_id=9140, order=80)
Gloss(id=88955, surface_form='ít', contextual_gloss='own', comments=" Stresses the previous word, here 'your'.", underlying_form=None, language_id=7, glossed_text_id=None, order=None)
Gloss(id=88956, surface_form='uṣo', contextual_gloss='O dawn', comments=None, underlying_form='uṣas', language_id=7, glossed_text_id=9140, order=100)


In [17]:
# how many glosses are missing?
print(len(gloss_context))

153


In [23]:
# how often does the glossed_text_id match above and below?
matching_glossed_texts = {
    gloss_id: glosses for gloss_id, glosses in gloss_context.items()
    if (
        len(glosses) == 3 and 
        glosses[0].glossed_text_id is not None and
        glosses[0].glossed_text_id == glosses[2].glossed_text_id
    )
}
print(len(matching_glossed_texts))

64


In [24]:
# of those 64, how many have exactly 20 spaces between the first and last order?
easy_order = {
    gloss_id: glosses for gloss_id, glosses in matching_glossed_texts.items()
    if (glosses[2].order - glosses[0].order) == 20
}
print(len(easy_order))

25


In [25]:
# that is lower than I had hoped. What do the ones who don't fit that description look like?
for gloss_id, glosses in matching_glossed_texts.items():
    if gloss_id not in easy_order:
        for g in glosses:
            print(g)
        break

Gloss(id=89068, surface_form='ā́dityāsa', contextual_gloss='Adityas', comments=None, underlying_form='ā́dityāsas', language_id=7, glossed_text_id=9152, order=50)
Gloss(id=89069, surface_form='ámatim', contextual_gloss='lack of thought', comments=None, underlying_form=None, language_id=7, glossed_text_id=None, order=None)
Gloss(id=89070, surface_form='ŕ̥dhag', contextual_gloss='on one side', comments=None, underlying_form='ŕ̥dhak', language_id=7, glossed_text_id=9152, order=80)


In [27]:
# Hmm. The actual word between those two is utā́matim which can be found in this gloss
gloss[111526]

Gloss(id=111526, surface_form='utā́matim', contextual_gloss='and lack of thought', comments=None, underlying_form='utá ámatim', language_id=7, glossed_text_id=9152, order=60)

In [28]:
# spot checking some others
for gloss_id, glosses in easy_order.items():
    for g in glosses:
        print(g)
    break

Gloss(id=88954, surface_form='távéd', contextual_gloss='your own', comments='The particle <span class="SanskExt" lang="sa">ít</span> stresses the previous word, here \'your\'.', underlying_form='táva ít', language_id=7, glossed_text_id=9140, order=80)
Gloss(id=88955, surface_form='ít', contextual_gloss='own', comments=" Stresses the previous word, here 'your'.", underlying_form=None, language_id=7, glossed_text_id=None, order=None)
Gloss(id=88956, surface_form='uṣo', contextual_gloss='O dawn', comments=None, underlying_form='uṣas', language_id=7, glossed_text_id=9140, order=100)


That word is not even in the glossed text.

úd usríyāḥ sr̥jate sū́riyaḥ sácām̐ <br/>
udyán nákṣatram arcivát <br/>
távéd uṣo viúṣi sū́riyasya ca <br/>
sám bhakténa gamemahi <br/><br/>

Huzzah! It looks like all of these are errors that have been corrected in the data set already. We can just ignore any glosses that have `NULL` for `order` or `glossed_text_id`.