# Description

In this notebookm we employ `latincy` (SpaCy NLP model for latin  - see Burns, P. J. (2023). LatinCy: Synthetic Trained Pipelines for Latin NLP. https://doi.org/10.48550/ARXIV.2305.04365) to make an automatic (neural-networks driven) language annotation of all texts in noscemus. To improve the results, we first do some cleaning of the texts. We also have to develop a specific approach to deal with large text files, which cannot be processed by SpaCy at once.

* INPUT: text files in the "noscemus_raw" subdirectory on sciencedata
* OUTPUT: json files in the "noscemus_spacyjsons_v1" subdirectory on sciencedata (WARNING: large folder, more than 20 GB in total) 

In [1]:
import sddk
import re

# Spacy setup (run only for first time!)

In [15]:
!wget https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl
!mv la_core_web_lg-any-py3-none-any.whl la_core_web_lg-3.7.6-py3-none-any.whl

--2024-07-26 13:18:15--  https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl
Resolving huggingface.co (huggingface.co)... 2600:9000:2127:5e00:17:b174:6d00:93a1, 2600:9000:2127:2e00:17:b174:6d00:93a1, 2600:9000:2127:a000:17:b174:6d00:93a1, ...
Connecting to huggingface.co (huggingface.co)|2600:9000:2127:5e00:17:b174:6d00:93a1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/20/6c/206c141b24bcbc6e087cc8bcea2743f3c78a3e9d01b700a7829b7a2acfbe486e/0baa63d3bbe38414907f3a60ae45c4b99ee455c2e08bde544d01215bdde73c89?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27la_core_web_lg-any-py3-none-any.whl%3B+filename%3D%22la_core_web_lg-any-py3-none-any.whl%22%3B&Expires=1722251895&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMjI1MTg5NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy8yMC82Yy8yMDZjMTQxYjI0YmNiYzZ

In [24]:
!../noscemus_venv/bin/python -m pip install la_core_web_lg-3.7.6-py3-none-any.whl --ignore-installed

Processing ./la_core_web_lg-3.7.6-py3-none-any.whl
Collecting spacy-lookups-data@ git+https://github.com/diyclassics/spacy-lookups-data.git#egg=spacy-lookups-data (from la-core-web-lg==3.7.6)
  Cloning https://github.com/diyclassics/spacy-lookups-data.git to /private/var/folders/57/tg7c_g894t5c2z3swkqzds5h0000gn/T/pip-install-fx8hqulr/spacy-lookups-data_8a3e1fab195d45848c6e9559b48b96d0
  Running command git clone --filter=blob:none --quiet https://github.com/diyclassics/spacy-lookups-data.git /private/var/folders/57/tg7c_g894t5c2z3swkqzds5h0000gn/T/pip-install-fx8hqulr/spacy-lookups-data_8a3e1fab195d45848c6e9559b48b96d0
  Resolved https://github.com/diyclassics/spacy-lookups-data.git to commit 5f2b7e60d3b461cd61649c0bb75f65a242b56ece
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting spacy<3.8.0,>=3.7.4 (from la-core-web-lg==3.7.6)
  Using cached s

In [13]:
!rm  la_core_web_lg-3.7.6-py3-none-any.whl

In [2]:
import spacy
from spacy.tokens import Doc

In [3]:
spacy.require_gpu()

True

In [4]:
spacy.prefer_gpu() # check whether the gpu is on

True

In [5]:
nlp = spacy.load('la_core_web_lg')

In [6]:
nlp.max_length # the maximal number of characters an input document can contain to be processed  

1000000

### Spacy test

In [5]:
# what elements are in the pipeline?
nlp.pipeline

[('normer', <function la_core_web_lg.functions.normer(doc)>),
 ('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x30e5b4290>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x313fdbcb0>),
 ('morphologizer',
  <spacy.pipeline.morphologizer.Morphologizer at 0x313fdbef0>),
 ('trainable_lemmatizer',
  <spacy.pipeline.edit_tree_lemmatizer.EditTreeLemmatizer at 0x313fdb290>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x311bd8740>),
 ('lookup_lemmatizer',
  <function la_core_web_lg.functions.make_lookup_lemmatizer_function(doc)>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x311bd8970>)]

In [6]:
# check the functionilty with one famous classical texts
vitruvius = "Architecti Augusti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur. Opera ea nascitur et fabrica et ratiocinatione."

In [7]:
doc = nlp(vitruvius)

In [8]:
for token in doc:
    print(f"Token: {token.text}")
    print(f"Lemma: {token.lemma_}")
    print(f"Part of Speech: {token.pos_}")
    print(f"Is Stop Word: {token.is_stop}")
    print(f"Dependency: {token.dep_}")
    print("------")

Token: Architecti
Lemma: architectus
Part of Speech: ADJ
Is Stop Word: False
Dependency: ROOT
------
Token: Augusti
Lemma: Augustus
Part of Speech: PROPN
Is Stop Word: False
Dependency: nmod
------
Token: est
Lemma: sum
Part of Speech: AUX
Is Stop Word: True
Dependency: aux:pass
------
Token: scientia
Lemma: scientia
Part of Speech: NOUN
Is Stop Word: False
Dependency: nsubj:pass
------
Token: pluribus
Lemma: plus
Part of Speech: DET
Is Stop Word: False
Dependency: det
------
Token: disciplinis
Lemma: disciplina
Part of Speech: NOUN
Is Stop Word: False
Dependency: nsubj:pass
------
Token: et
Lemma: et
Part of Speech: CCONJ
Is Stop Word: True
Dependency: cc
------
Token: variis
Lemma: uarius
Part of Speech: ADJ
Is Stop Word: False
Dependency: amod
------
Token: eruditionibus
Lemma: eruditio
Part of Speech: NOUN
Is Stop Word: False
Dependency: conj
------
Token: ornata
Lemma: orno
Part of Speech: VERB
Is Stop Word: False
Dependency: conj
------
Token: ,
Lemma: ,
Part of Speech: PUNCT
Is 

In [9]:
for token in doc:
    print(token.text, token.pos_, token.lemma_)

Architecti ADJ architectus
Augusti PROPN Augustus
est AUX sum
scientia NOUN scientia
pluribus DET plus
disciplinis NOUN disciplina
et CCONJ et
variis ADJ uarius
eruditionibus NOUN eruditio
ornata VERB orno
, PUNCT ,
quae PRON qui
ab ADP ab
ceteris DET ceterus
artibus NOUN ars
perficiuntur VERB perficio
. PUNCT .
Opera NOUN opus
ea PRON is
nascitur VERB nascor
et CCONJ et
fabrica NOUN fabrica
et CCONJ et
ratiocinatione NOUN ratiocinatio
. PUNCT .


In [10]:
all_sents_lemmata = []
for sent in doc.sents:
    sent_lemmata = []
    for token in sent:
        if token.pos_ in ["NOUN", "VERB", "ADJ"]:
            sent_lemmata.append(token.lemma_)
    all_sents_lemmata.append(sent_lemmata)

In [11]:
doc.ents

()

In [12]:
all_sents_lemmata

[['architectus',
  'scientia',
  'disciplina',
  'uarius',
  'eruditio',
  'orno',
  'ars',
  'perficio'],
 ['opus', 'nascor', 'fabrica', 'ratiocinatio']]

# Apply spacy model on nocsemus

In [8]:
s =  sddk.cloudSession(public_folder_url="https://sciencedata.dk/public/87394f685b79e7f1ebd4a7ead2b4941c/")

In [97]:
# s = sddk.cloudSession(provider="sciencedata.dk", shared_folder_name="TOME/DATA/NOSCEMUS", owner="kase@zcu.cz")

connection with shared folder established with you as its ordinary user
endpoint variable has been configured to: https://sciencedata.dk/sharingout/kase%40zcu.cz/TOME/DATA/NOSCEMUS/


In [98]:
# load metadata

In [6]:
# extract a list of ids for iteration
filenames_list = s.list_filenames("noscemus_raw", "txt")

In [7]:
filenames_list[:10]

['1031760.txt',
 '1085290.txt',
 '1285853.txt',
 '1285854.txt',
 '1285855.txt',
 '1285856.txt',
 '1365811.txt',
 '1370560.txt',
 '1378359.txt',
 '1424044.txt']

In [101]:
[fn for fn in filenames_list if "_" in fn]

[]

In [102]:
ids = [fn.partition(".")[0] for fn in filenames_list]

In [105]:
jsonfiles_list = s.list_filenames("noscemus_spacyjsons_v1", "json")

In [107]:
jsonfiles_list

['1031760.json',
 '1085290.json',
 '1285853.json',
 '1285854.json',
 '1285855.json',
 '1285856.json',
 '1365811.json',
 '1370560.json',
 '1378359.json',
 '1424044.json',
 '1461594.json',
 '1479057.json',
 '1509197.json',
 '1509290.json',
 '1526071.json',
 '1528734.json',
 '1567826.json',
 '597675.json',
 '597737.json',
 '597799.json',
 '598104.json',
 '598116.json',
 '598518.json',
 '599651.json',
 '599653.json',
 '599722.json',
 '599723.json',
 '599724.json',
 '599725.json',
 '599726.json',
 '599727.json',
 '599728.json',
 '599729.json',
 '604274.json',
 '604275.json',
 '604277.json',
 '604307.json',
 '604308.json',
 '604309.json',
 '604311.json',
 '604889.json',
 '604890.json',
 '604891.json',
 '604892.json',
 '604893.json',
 '604894.json',
 '605034.json',
 '605035.json',
 '605036.json',
 '605037.json',
 '605038.json',
 '605039.json',
 '605239.json',
 '605240.json',
 '605241.json',
 '605285.json',
 '605286.json',
 '605287.json',
 '605288.json',
 '605289.json',
 '605290.json',
 '60529

In [108]:
filenames_list[1]

'1085290.txt'

# Text cleaning

In [109]:
filename = filenames_list[1]
rawtext = s.read_file("noscemus_raw/" + filename, "str")

In [110]:
len(rawtext)

2988478

In [23]:
#@Language.component("text_cleaner")
#def text_cleaner(rawtext):
#    for token in doc:
#        token.norm_ = token.norm_.replace("¬\n", "").replace("\n", " ").replace("ß", "ss").replace("ij","ii")
#    return doc

In [24]:
#nlp.add_pipe("text_cleaner", after="normer") 
#nlp.pipeline                                                          

In [112]:
def text_cleaner(rawtext):
    cleantext = rawtext.replace("¬\n", "").replace("\n", " ").replace("ß", "ss").replace("ij","ii")
    cleantext = " ".join([t[0] + t[1:].lower() for t in cleantext.split()])
    cleantext = re.sub("\s\s+", " ", cleantext)
    return cleantext

In [113]:
cleantext = text_cleaner(rawtext)

In [114]:
rawtext[:1000]

'GEORG ABRAHAM MERCKLINI\nLINDENIUS RENOVATUS\nDE\nSCRIPTIS MEDICIS\n\n\nLINDENIUS RENQVATUS,\nVE\nS1\nJOHANNIS ANTONIDAE van der LINDEN\nDE\nSCRIA IAUMEDICIS\nLIBRIDVO.\nUORUM PRIOR, OMNIUM, TAM\nC\nVeterum, quàm Recentiorum, Latino idiomate, typis unquam\nexpressorum Scriptorum Medicorum, consummatissimum Catalogum conti¬\nnet; quo indicatur, quid singuli Authores scripserint: nec non ubi, quâ\nformâ, & quo tempore, omnes eorum Scriptorum Editiones\nexcusae prostent:\nPosterior verò Cynosuram Medicam, sive, Rerum & Mate¬\nriarum Indicem, omnium Titulorum vel Thematum Medicorum potiorum\nCommunia Alphabetico hâcque novâ demum Editione primùm adornato ordine suis\nLglicita comprehendentem exhibet, ut inquirenti, quicquid desideraverit, velut\ndigito, in multiplicem usum, clarissimè monstretur:\nNOVITER PRAETER HAEC ADDITA PLURIMORUM AUTHORUM,\nquotquot nempe habere licuit, Vitae Curriculorum succinctâ\nDescriptione:\nAdscita undique ab exteris Medicis subsidiariâ ope, propriâque ultra\

In [115]:
cleantext[:10000]

'Georg Abraham Mercklini Lindenius Renovatus De Scriptis Medicis Lindenius Renqvatus, Ve S1 Johannis Antonidae van der Linden De Scria Iaumedicis Libridvo. Uorum Prior, Omnium, Tam C Veterum, quàm Recentiorum, Latino idiomate, typis unquam expressorum Scriptorum Medicorum, consummatissimum Catalogum continet; quo indicatur, quid singuli Authores scripserint: nec non ubi, quâ formâ, & quo tempore, omnes eorum Scriptorum Editiones excusae prostent: Posterior verò Cynosuram Medicam, sive, Rerum & Materiarum Indicem, omnium Titulorum vel Thematum Medicorum potiorum Communia Alphabetico hâcque novâ demum Editione primùm adornato ordine suis Lglicita comprehendentem exhibet, ut inquirenti, quicquid desideraverit, velut digito, in multiplicem usum, clarissimè monstretur: Noviter Praeter Haec Addita Plurimorum Authorum, quotquot nempe habere licuit, Vitae Curriculorum succinctâ Descriptione: Adscita undique ab exteris Medicis subsidiariâ ope, propriâque ultra decennium adhibitâ singulari operâ

In [116]:
doc = nlp(cleantext[:10000])
doc

Georg Abraham Mercklini Lindenius Renovatus De Scriptis Medicis Lindenius Renqvatus, Ve S1 Johannis Antonidae van der Linden De Scria Iaumedicis Libridvo. Uorum Prior, Omnium, Tam C Veterum, quàm Recentiorum, Latino idiomate, typis unquam expressorum Scriptorum Medicorum, consummatissimum Catalogum continet; quo indicatur, quid singuli Authores scripserint: nec non ubi, quâ formâ, & quo tempore, omnes eorum Scriptorum Editiones excusae prostent: Posterior verò Cynosuram Medicam, sive, Rerum & Materiarum Indicem, omnium Titulorum vel Thematum Medicorum potiorum Communia Alphabetico hâcque novâ demum Editione primùm adornato ordine suis Lglicita comprehendentem exhibet, ut inquirenti, quicquid desideraverit, velut digito, in multiplicem usum, clarissimè monstretur: Noviter Praeter Haec Addita Plurimorum Authorum, quotquot nempe habere licuit, Vitae Curriculorum succinctâ Descriptione: Adscita undique ab exteris Medicis subsidiariâ ope, propriâque ultra decennium adhibitâ singulari operâ 

In [120]:
for token in doc:
    print(token.lemma_, token.pos_)

Georg PROPN
Abraham PROPN
Mercklinus PROPN
Lindenius PROPN
Renovatus PROPN
De ADP
Scriptis ADJ
Medica NOUN
Lindenius PROPN
Renqvatus PROPN
, PUNCT
Ue PRON
S1 ADV
Johann PROPN
Antonidae PROPN
van PROPN
der ADP
Linden 
de ADP
Scria PROPN
Iaumedicus NOUN
Libridvo PROPN
. PUNCT
Uorum NOUN
Prior ADJ
, PUNCT
omnium ADJ
, PUNCT
tam ADV
c PROPN
Ueterum ADJ
, PUNCT
quàm NOUN
Recentiorum ADJ
, PUNCT
Latinus PROPN
idioma NOUN
, PUNCT
typum NOUN
umquam ADV
expressus VERB
scriptorum NOUN
Medicus NOUN
, PUNCT
consummatis ADJ
Catalogus PROPN
contineo VERB
; PUNCT
qui PRON
indico VERB
, PUNCT
quis PRON
singulus ADJ
Author NOUN
scripserint VERB
: PUNCT
nec CCONJ
non PART
ubi ADV
, PUNCT
quâ DET
formâ NOUN
, PUNCT
& PUNCT
qui PRON
tempus NOUN
, PUNCT
omnis ADJ
is PRON
Scripti NOUN
Editiones NOUN
excusae VERB
prosto VERB
: PUNCT
posterior ADJ
uerò NOUN
Cynosura PROPN
Medicus ADJ
, PUNCT
siue CCONJ
, PUNCT
Re NOUN
& PUNCT
Materia NOUN
Indix NOUN
, PUNCT
omnis ADJ
Tituli NOUN
uel CCONJ
Themat PROPN
Medicus

# working with large files - development

In [30]:
cleantext = cleantext[:380000]

In [31]:
# segments docs
segment_docs = []
segment_len = 100000
if len(cleantext) > segment_len:
    parts = cleantext[:segment_len].rpartition(". ")
    current_segment = parts[0] + parts[1]
    segment_doc = nlp(current_segment)
    segment_docs.append(segment_doc)
    next_segment_beginning = parts[2]
    for n in range(segment_len, len(cleantext), segment_len):
        print(n)
        segment = cleantext[n:n+segment_len]
        if len(segment) == segment_len: 
            parts = cleantext[n:n+segment_len].rpartition(". ")
            current_segment = parts[0] + parts[1]
            segment_doc = nlp(next_segment_beginning + current_segment)
            next_segment_beginning = parts[2]
        else:
            segment_doc = nlp(segment)
        segment_docs.append(segment_doc)
    doc = Doc.from_docs(segment_docs)
else:
    doc = nlp(cleantext)

100000
200000
300000


In [32]:
doc = Doc.from_docs(segment_docs)

In [33]:
cleantext[199900:200100]

'nica Roberti Boylei, de Vi Aeris elastico, & ejusdem effectibus; quibus Observata illius rationibus Philosophicis, omni Vacuum, ipsumque elaterem Aeris Pecquetianum arcentibus, illustrantur. Gröningae'

In [34]:
doc.text[199900:200100]

'nica Roberti Boylei, de Vi Aeris elastico, & ejusdem effectibus; quibus Observata illius rationibus Philosophicis, omni Vacuum, ipsumque elaterem Aeris Pecquetianum arcentibus, illustrantur. Gröningae'

In [36]:
# lets encapsulate the cleaning and spacy pipeline application into one function
def from_rawtext_to_doc(rawtext):
    cleantext = text_cleaner(rawtext)
    segment_len = 800000
    if len(cleantext) > segment_len:
        segment_docs = []
        parts = cleantext[:segment_len].rpartition(". ")
        current_segment = parts[0] + parts[1]
        segment_doc = nlp(current_segment)
        segment_docs.append(segment_doc)
        next_segment_beginning = parts[2]
        for n in range(segment_len, len(cleantext), segment_len):
            segment = cleantext[n:n+segment_len]
            if len(segment) == segment_len:
                parts = cleantext[n:n+segment_len].rpartition(". ")
                current_segment = parts[0] + parts[1]
                segment_doc = nlp(next_segment_beginning + current_segment)
                next_segment_beginning = parts[2]
            else:
                segment_doc = nlp(segment)
            segment_docs.append(segment_doc)
        doc = Doc.from_docs(segment_docs)
    else:
        doc = nlp(cleantext)
    return doc

# Applying the function

In [37]:
# input text files
filenames_list = s.list_filenames("noscemus_raw", "txt")

In [54]:
len(filenames_list)

1007

In [43]:
# output jsonfiles
target_folder_name = "noscemus_spacyjsons_v1"
jsonfiles_list = s.list_filenames(target_folder_name, "json")
jsonfiles_list[:10]

['1031760.json',
 '1085290.json',
 '1285853.json',
 '1285854.json',
 '1285855.json',
 '1285856.json',
 '1365811.json',
 '1370560.json',
 '1378359.json',
 '1424044.json']

In [44]:
len(jsonfiles_list)

842

In [40]:
%%time
%%capture
for n, filename in enumerate(filenames_list):
    if n in range(0, len(filenames_list), 50):
        print(n)
    try:
        new_filename = filename.partition(".")[0] + ".json"
        if new_filename not in jsonfiles_list: # if the file is NOT in the files we already generated in the previous iteration of the script...
            rawtext = s.read_file("noscemus_raw/" + filename, "str")
            doc = from_rawtext_to_doc(rawtext)
            doc_json = doc.to_json()
            s.write_file(target_folder_name + "/" + new_filename, doc_json)
    except:
        pass

CPU times: user 9h 13min 45s, sys: 20min 57s, total: 9h 34min 42s
Wall time: 10h 48min 17s


In [45]:
len(jsonfiles_list)

842

# Dealing with missing files

In [57]:
# what should we have
all_ids_jsons = [fn.rpartition(".")[0] + ".json" for fn in filenames_list]

In [69]:
# output jsonfiles
target_folder_name = "noscemus_spacyjsons_v1"
jsonfiles_list = s.list_filenames(target_folder_name, "json")
jsonfiles_list[:10]

['1031760.json',
 '1085290.json',
 '1285853.json',
 '1285854.json',
 '1285855.json',
 '1285856.json',
 '1365811.json',
 '1370560.json',
 '1378359.json',
 '1424044.json']

In [70]:
missing_jsons = [fn for fn in all_ids_jsons if fn not in jsonfiles_list]
len(missing_jsons)

164

In [71]:
missing_jsons[0]

'862097.json'

In [72]:
rawtext = s.read_file("noscemus_raw/" + missing_jsons[0].rpartition(".")[0] + ".txt", "str")
#doc = from_rawtext_to_doc(rawtext)

In [73]:
len(rawtext)

174954

In [74]:
doc = from_rawtext_to_doc(rawtext)

In [75]:
len(doc.text)

172298

In [76]:
doc_json = doc.to_json()

In [77]:
s.write_file(target_folder_name + "/" + missing_jsons[0], doc_json)

Your <class 'dict'> object has been succesfully written as "https://sciencedata.dk/sharingout/kase%40zcu.cz/TOME/DATA/NOSCEMUS/noscemus_spacyjsons_v1/862097.json"


In [78]:
%%time
%%capture
for n, filename in enumerate(filenames_list):
    try:
        new_filename = filename.partition(".")[0] + ".json"
        if new_filename not in jsonfiles_list:
            rawtext = s.read_file("noscemus_raw/" + filename, "str")
            doc = from_rawtext_to_doc(rawtext)
            doc_json = doc.to_json()
            s.write_file(target_folder_name + "/" + new_filename, doc_json)
    except:
        pass

CPU times: user 2h 17min 24s, sys: 5min 13s, total: 2h 22min 38s
Wall time: 2h 37min 43s
