# Description

In this notebookm we employ `latincy` (SpaCy NLP model for latin  - see Burns, P. J. (2023). LatinCy: Synthetic Trained Pipelines for Latin NLP. https://doi.org/10.48550/ARXIV.2305.04365) to make an automatic (neural-networks driven) language annotation of all texts in noscemus. To improve the results, we first do some cleaning of the texts. We also have to develop a specific approach to deal with large text files, which cannot be processed by SpaCy at once.

* INPUT: text files in the "noscemus_raw" subdirectory
* OUTPUT: json files in the "noscemus_spacyjsons_v1" subdirectory (WARNING: large folder, more than 20 GB in total) 

# Spacy setup (run only for first time!)

In [1]:
#!wget https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl
#!mv la_core_web_lg-any-py3-none-any.whl la_core_web_lg-3.7.6-py3-none-any.whl

In [2]:
#!/srv/venvs/latin_venv/bin/python -m pip install la_core_web_lg-3.7.6-py3-none-any.whl --ignore-installed

In [3]:
#!rm  la_core_web_lg-3.7.6-py3-none-any.whl

In [None]:
print("hello")

# Start here

In [None]:
import spacy
import os
import glob
from spacy.tokens import Doc
from spacy.language import Language
import pickle
from unidecode import unidecode
import sddk
import pandas as pd
import re
import cupy
import json

In [7]:
try:
    # Check if GPU is available
    spacy.require_gpu()

    # Verify CuPy can initialize the GPU
    cupy_array = cupy.zeros((10, 10))
    print("CuPy is able to use the GPU.")

    print("GPU is available for SpaCy.")

except ValueError as e:
    print(f"Unable to use GPU for SpaCy: {e}")
except Exception as ex:
    print(f"An error occurred: {ex}")


CuPy is able to use the GPU.
GPU is available for SpaCy.


In [5]:
spacy.require_gpu() 

True

In [6]:
nlp = spacy.load('la_core_web_lg')

In [8]:
nlp.max_length # the maximal number of characters an input document can contain to be processed  

1000000

### Spacy test

In [9]:
# what elements are in the pipeline?
nlp.pipeline

[('senter', <spacy.pipeline.senter.SentenceRecognizer at 0x72bb27145d80>),
 ('normer', <function la_core_web_lg.functions.normer(doc)>),
 ('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x72bb27145180>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x72bb27145e40>),
 ('morphologizer',
  <spacy.pipeline.morphologizer.Morphologizer at 0x72bb27145ea0>),
 ('trainable_lemmatizer',
  <spacy.pipeline.edit_tree_lemmatizer.EditTreeLemmatizer at 0x72bb27145b40>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x72bb24bbc890>),
 ('lookup_lemmatizer',
  <function la_core_web_lg.functions.make_lookup_lemmatizer_function(doc)>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x72bb24bbc7b0>)]

In [7]:
# check the functionilty with one famous classical texts
vitruvius = "Architecti Augusti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur. Opera ea nascitur et fabrica et ratiocinatione."

In [12]:
sent = "Architecti Augusti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur. Opera ea nascitur et fabrica et ratiocinatione (Aristot.)."
doc = nlp(sent)

In [13]:
for t in doc:
    print(t.text, t.lemma_)

Architecti architectus
Augusti Augustus
est sum
scientia scientia
pluribus plus
disciplinis disciplina
et et
variis uarius
eruditionibus eruditio
ornata orno
, ,
quae qui
ab ab
ceteris ceterus
artibus ars
perficiuntur perficio
. .
Opera opera
ea is
nascitur nascor
et et
fabrica fabrica
et et
ratiocinatione ratiocinatio
( (
Aristot Aristot
. .
) )
. .


In [12]:
for token in doc:
    print(f"Token: {token.text}")
    print(f"Lemma: {token.lemma_}")
    print(f"Part of Speech: {token.pos_}")
    print(f"Is Stop Word: {token.is_stop}")
    print(f"Dependency: {token.dep_}")
    print("------")

Token: Architecti
Lemma: architectus
Part of Speech: VERB
Is Stop Word: False
Dependency: ROOT
------
Token: Augusti
Lemma: Augustus
Part of Speech: PROPN
Is Stop Word: False
Dependency: flat:name
------
Token: est
Lemma: sum
Part of Speech: AUX
Is Stop Word: True
Dependency: cop
------
Token: scientia
Lemma: scientia
Part of Speech: NOUN
Is Stop Word: False
Dependency: nsubj:pass
------
Token: pluribus
Lemma: plus
Part of Speech: DET
Is Stop Word: False
Dependency: det
------
Token: disciplinis
Lemma: disciplina
Part of Speech: NOUN
Is Stop Word: False
Dependency: obl
------
Token: et
Lemma: et
Part of Speech: CCONJ
Is Stop Word: True
Dependency: cc
------
Token: variis
Lemma: uarius
Part of Speech: ADJ
Is Stop Word: False
Dependency: amod
------
Token: eruditionibus
Lemma: eruditio
Part of Speech: NOUN
Is Stop Word: False
Dependency: obl
------
Token: ornata
Lemma: orno
Part of Speech: VERB
Is Stop Word: False
Dependency: conj
------
Token: ,
Lemma: ,
Part of Speech: PUNCT
Is Stop Wo

In [13]:
for token in doc:
    print(token.text, token.pos_, token.lemma_)

Architecti VERB architectus
Augusti PROPN Augustus
est AUX sum
scientia NOUN scientia
pluribus DET plus
disciplinis NOUN disciplina
et CCONJ et
variis ADJ uarius
eruditionibus NOUN eruditio
ornata VERB orno
, PUNCT ,
quae PRON qui
ab ADP ab
ceteris DET ceterus
artibus NOUN ars
perficiuntur VERB perficio
. PUNCT .
Opera NOUN opera
ea PRON is
nascitur VERB nascor
et CCONJ et
fabrica NOUN fabrica
et CCONJ et
ratiocinatione NOUN ratiocinatio
. PUNCT .


In [14]:
all_sents_lemmata = []
for sent in doc.sents:
    sent_lemmata = []
    for token in sent:
        if token.pos_ in ["NOUN", "VERB", "ADJ"]:
            sent_lemmata.append(token.lemma_)
    all_sents_lemmata.append(sent_lemmata)

In [15]:
doc.ents

(Augusti,)

In [16]:
all_sents_lemmata

[['architectus',
  'scientia',
  'disciplina',
  'uarius',
  'eruditio',
  'orno',
  'ars',
  'perficio'],
 ['opera', 'nascor', 'fabrica', 'ratiocinatio']]

# Apply spacy model on nocsemus

In [17]:
# s =  sddk.cloudSession(public_folder_url="https://sciencedata.dk/public/87394f685b79e7f1ebd4a7ead2b4941c/")

In [18]:
filenames_list = os.listdir("/srv/data/tome/noscemus/noscemus_raw")
filenames_list[:10]

['913059.txt',
 '801742.txt',
 '888136.txt',
 '720097.txt',
 '663952.txt',
 '694621.txt',
 '868572.txt',
 '664562.txt',
 '747384.txt',
 '795562.txt']

In [19]:
ids = [fn.partition(".")[0] for fn in filenames_list]

In [22]:
filenames_list[1]

'801742.txt'

# Text cleaning

In [24]:
source_path = "/srv/data/tome/noscemus/noscemus_raw/"

In [25]:
filename = filenames_list[20]
with open(source_path + filename, "r", encoding="utf-8") as f:
    rawtext = f.read()

In [26]:
len(rawtext)

456272

In [28]:
def text_cleaner(rawtext, lowercase=True):
    cleantext = rawtext.replace("¬\n", "").replace("\n", " ").replace("ß", "ss").replace("ij","ii")
    cleantext = " ".join([t[0] + t[1:].lower() for t in cleantext.split()])
    cleantext = re.sub("\s\s+", " ", cleantext)
    cleantext = unidecode(cleantext)
    cleantext = cleantext.replace("v", "u").replace("V", "U")
    if lowercase:
        cleantext = cleantext.lower()
    return cleantext

# lets encapsulate the cleaning and spacy pipeline application into one function
def from_rawtext_to_doc(rawtext, lowertext=True):
    cleantext = text_cleaner(rawtext, lowertext)
    segment_len = 800000
    if len(cleantext) > segment_len:
        segment_docs = []
        parts = cleantext[:segment_len].rpartition(". ")
        current_segment = parts[0] + parts[1]
        segment_doc = nlp(current_segment)
        segment_docs.append(segment_doc)
        next_segment_beginning = parts[2]
        for n in range(segment_len, len(cleantext), segment_len):
            segment = cleantext[n:n+segment_len]
            if len(segment) == segment_len:
                parts = cleantext[n:n+segment_len].rpartition(". ")
                current_segment = parts[0] + parts[1]
                segment_doc = nlp(next_segment_beginning + current_segment)
                next_segment_beginning = parts[2]
            else:
                segment_doc = nlp(segment)
            segment_docs.append(segment_doc)
        doc = Doc.from_docs(segment_docs)
    else:
        doc = nlp(cleantext)
    return doc

In [29]:
cleantext = text_cleaner(rawtext[:1000])
cleantext

'epistolarum ab eruditis uiris ad alb. hallerum scriparum pars i. latinae. uol. u. epistolae 133. ad 277. scriptae ab anno mdcclxi. ad annum mdcclxuiii. bernae. sumptibus societatis typographicae. 1774. lecturis s. hallerus. ad hoc uolumen idem semper est quod moneam, nimia nempe repeti encomia, quae delere neglexi morbo oppressus, & temporis penuria. epistolas etiam alias aliis majoris esse momenti fateor, & nonnullas illustrium alioquin uirorum, paulo minus propriis * 2 adno iu adnotationibus diuites. in aliis erat, quae, dum legeret doleret aliquis, ea, quantum potui, deleui. anno 1764. rupe discessi, & bernam repetii. unicum quod superest uolumen latinarum epistolarum finem faciet. d. d. 13. jun. 1774. epi epistolarum tabula gottlieb emanuel haller. ep. 133. 134. 136. 137. 138. 143. 145. 150. 151. 159. werner la chenal. ep. 135. 140. 147. 157. 203. 238. 248. 250. 251. 258. 260. 263. 267. 269. 272. 277. em. berdot, fil. ep. 139. 161. 193. 238. j. b morgagnus + ep.'

In [30]:
rawtext[:1000]

'EPISTOLARUM\nAB\nERUDITIS VIRIS\nAD\nALB. HALLERUM\nSCRIPARUM\nPARS I.\nLatinae.\nVOL. V.\nEPISTOLAE 133. ad 277.\nSCRIPTAE AB ANNO MDCCLXI.\nAD ANNUM MDCCLXVIII.\nBERNAE.\nSumptibus Societatis Typographicae.\n1774.\n\n\nLECTURIS S.\nHALLERUS.\nAd hoc volumen idem semper est\nquod moneam, nimia nempe repeti\nencomia, quae delere neglexi morbo op¬\npressus, & temporis penuria. Episto¬\nlas etiam alias aliis majoris esse mo¬\nmenti fateor, & nonnullas illustrium\nalioquin virorum, paulo minus propriis\n* 2\nadno¬\n\n\nIV\n\n\nadnotationibus divites. In aliis erat,\nquae, dum legeret doleret aliquis, ea,\nquantum potui, delevi. Anno 1764. Ru¬\npe discessi, & Bernam repetii. Unicum\nquod superest volumen latinarum epi¬\nstolarum finem faciet. D. d. 13. Jun.\n1774.\nEPI¬\n\n\nEPISTOLARUM\nTABULA\nGOTTLIEB EMANUEL HALLER. Ep. 133. 134.\n136. 137. 138. 143. 145. 150. 151. 159.\nWERNER LA CHENAL. Ep. 135. 140. 147. 157.\n203. 238. 248. 250. 251. 258. 260. 263. 267.\n269. 272. 277.\nEM. BERDOT

In [31]:
doc = nlp(cleantext[:10000])
doc

epistolarum ab eruditis uiris ad alb. hallerum scriparum pars i. latinae. uol. u. epistolae 133. ad 277. scriptae ab anno mdcclxi. ad annum mdcclxuiii. bernae. sumptibus societatis typographicae. 1774. lecturis s. hallerus. ad hoc uolumen idem semper est quod moneam, nimia nempe repeti encomia, quae delere neglexi morbo oppressus, & temporis penuria. epistolas etiam alias aliis majoris esse momenti fateor, & nonnullas illustrium alioquin uirorum, paulo minus propriis * 2 adno iu adnotationibus diuites. in aliis erat, quae, dum legeret doleret aliquis, ea, quantum potui, deleui. anno 1764. rupe discessi, & bernam repetii. unicum quod superest uolumen latinarum epistolarum finem faciet. d. d. 13. jun. 1774. epi epistolarum tabula gottlieb emanuel haller. ep. 133. 134. 136. 137. 138. 143. 145. 150. 151. 159. werner la chenal. ep. 135. 140. 147. 157. 203. 238. 248. 250. 251. 258. 260. 263. 267. 269. 272. 277. em. berdot, fil. ep. 139. 161. 193. 238. j. b morgagnus + ep.

In [32]:
for token in doc:
    print(token.lemma_, token.pos_)

epistola NOUN
ab ADP
erudio VERB
uir NOUN
ad ADP
alb ADV
. PUNCT
halle NOUN
scripae NOUN
pars NOUN
i. PROPN
latinus ADJ
. PUNCT
 ADV
 ADJ
epistola NOUN
133 NUM
. PUNCT
ad ADP
277 NUM
. PUNCT
scripno VERB
ab ADP
annus NOUN
mdcclxius VERB
. PUNCT
ad ADP
annus NOUN
 ADJ
. PUNCT
bernus ADJ
. PUNCT
sumptus NOUN
societas NOUN
typographicus ADJ
. PUNCT
1774 X
. PUNCT
lego NOUN
s. ADJ
allerus NOUN
. PUNCT
ad ADP
hic DET
uolumen NOUN
idem DET
semper ADV
sum AUX
quod SCONJ
moneo VERB
, PUNCT
nimius ADJ
nempe ADV
repeto VERB
encomium NOUN
, PUNCT
qui PRON
deleo VERB
neglego VERB
morbus NOUN
opprimo VERB
, PUNCT
& PUNCT
tempus NOUN
penuria NOUN
. PUNCT
epistola NOUN
etiam ADV
alius DET
alius DET
maior DET
sum NOUN
momentum NOUN
fateor VERB
, PUNCT
& PUNCT
nonnullus ADJ
illustris ADJ
alioqui ADV
uir NOUN
, PUNCT
paulus ADV
paruus ADV
proprius ADJ
* PUNCT
2 NUM
adno NOUN
iu NUM
adnotatio NOUN
diues ADJ
. PUNCT
in ADP
alius DET
sum AUX
, PUNCT
qui PRON
, PUNCT
dum SCONJ
lego VERB
doleo VERB
aliquis P

# Applying the function

In [33]:
len(filenames_list)

1007

In [10]:
target_path = "/srv/data/tome/noscemus/sents_data/"
try:
    os.mkdir(target_path)
except:
    pass

In [34]:
os.listdir("/srv/data/tome/noscemus/")

['noscemus_raw',
 'NOSCEMUS FULL',
 'lemmatized_sents',
 'sents_data_lower',
 'test2',
 'test',
 'NOSCEMUS_FULL.zip',
 'sents_data']

In [77]:
%%time
for n, filename in enumerate(filenames_list):
    id = filename.rpartition(".txt")[0]
    with open(source_path + filename, "r", encoding="utf-8") as f:
        rawtext = f.read()
    doc = from_rawtext_to_doc(rawtext)
    doc_sentdata = [(sent.text, [(t.text, t.lemma_, t.pos_, (t.idx - sent[0].idx, t.idx - sent[0].idx + len(t))) for t in sent]) for sent in doc.sents]
    with open(target_path + id + ".pickle", "wb") as f:
        pickle.dump(doc_sentdata, f)
    if n in range(0, len(filenames_list), 50): # print out the progress each 50 files...
        print(n)

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
CPU times: user 2h 21min 42s, sys: 7min 7s, total: 2h 28min 49s
Wall time: 2h 28min 45s


In [78]:
os.listdir(target_path)[:10]

['725075.pickle',
 '928138.pickle',
 '985903.pickle',
 '733505.pickle',
 '739101.pickle',
 '702145.pickle',
 '906214.pickle',
 '902259.pickle',
 '901017.pickle',
 '904418.pickle']

In [79]:
len(os.listdir(target_path))

1007

In [84]:
# test: loading back individual file
sents_data = pickle.load(open(target_path + filenames_list[20].replace(".txt", ".pickle"), "rb"))
sents_data[100:110]

[('Tarditis uerum est ad gratissimas tuas litteras respondissem, nisi nouum a Te beneficium in me redundasset.',
  [('Tarditis', 'Tarditus', 'PROPN', (0, 8)),
   ('uerum', 'uerus', 'ADJ', (9, 14)),
   ('est', 'sum', 'AUX', (15, 18)),
   ('ad', 'ad', 'ADP', (19, 21)),
   ('gratissimas', 'gratissimus', 'ADJ', (22, 33)),
   ('tuas', 'tuus', 'DET', (34, 38)),
   ('litteras', 'littera', 'NOUN', (39, 47)),
   ('respondissem', 'respondeo', 'VERB', (48, 60)),
   (',', ',', 'PUNCT', (60, 61)),
   ('nisi', 'nisi', 'SCONJ', (62, 66)),
   ('nouum', 'nouus', 'ADJ', (67, 72)),
   ('a', 'ab', 'ADP', (73, 74)),
   ('Te', 'tu', 'PRON', (75, 77)),
   ('beneficium', 'beneficium', 'NOUN', (78, 88)),
   ('in', 'in', 'ADP', (89, 91)),
   ('me', 'ego', 'PRON', (92, 94)),
   ('redundasset', 'redundo', 'VERB', (95, 106)),
   ('.', '.', 'PUNCT', (106, 107))]),
 ('Academiae nempe scientiarum Parisinae placuit me immeritum iis adscribere, quibus cum ipsi est litterarum commercium.',
  [('Academiae', 'Academia', '

In [15]:
source_pickles = "/srv/data/tome/noscemus/sents_data/"
target_jsons = "/srv/data/tome/noscemus/sents_data_jsons/"
try:
    os.mkdir(target_jsons)
except:
    pass

In [16]:
for fn in os.listdir(source_pickles):
    doc_id =  fn.rpartition(".")[0]
    with open(source_pickles + fn, "rb") as f:
        sents_data = pickle.load(f)
    sents_data_updated = []
    for sent_n, (sent_text, sent_data) in enumerate(sents_data):
        sents_data_updated.append((doc_id, sent_n, sent_text, sent_data))
    with open(target_jsons + doc_id + ".json", "w") as f:
        json.dump(sents_data_updated, f)

In [35]:
target_path = "/srv/data/tome/noscemus/sents_data_lower/"
try:
    os.mkdir(target_path)
except:
    pass

In [36]:
ready = os.listdir(target_path)
len(ready)

684

In [37]:
%%time
for n, filename in enumerate(filenames_list):
    if filename.replace(".txt", ".pickle") not in ready:
        id = filename.rpartition(".txt")[0]
        with open(source_path + filename, "r", encoding="utf-8") as f:
            rawtext = f.read()
        doc = from_rawtext_to_doc(rawtext)
        doc_sentdata = [(sent.text, [(t.text, t.lemma_, t.pos_, (t.idx - sent[0].idx, t.idx - sent[0].idx + len(t))) for t in sent]) for sent in doc.sents]
        with open(target_path + id + ".pickle", "wb") as f:
            pickle.dump(doc_sentdata, f)
        if n in range(0, len(filenames_list), 50): # print out the progress each 50 files...
            print(n)

700
750
800
850
900
950
1000
CPU times: user 48min 52s, sys: 2min 26s, total: 51min 18s
Wall time: 51min 19s


In [88]:
source_path = target_path
target_path = "/srv/data/tome/noscemus/lemmatized_sents/"
try:
    os.mkdir(target_path)
except:
    pass

In [89]:
fns = os.listdir(source_path)
fns[:10]

['725075.pickle',
 '928138.pickle',
 '985903.pickle',
 '733505.pickle',
 '739101.pickle',
 '702145.pickle',
 '906214.pickle',
 '902259.pickle',
 '901017.pickle',
 '904418.pickle']

In [91]:
for fn in fns:
    lemmatized_sents = []
    sents_data = pickle.load(open(source_path + fn, "rb"))
    for (sent_text, sent_data) in sents_data:
        lemmasent = []
        for wordform, lemma, tag, position in sent_data:
            if tag in ["NOUN", "PROPN", "ADJ", "VERB"]:
                lemmasent.append(lemma)
        lemmatized_sents.append(" ".join(lemmasent) + "\n")
    with open(target_path + fn.replace(".pickle", ".txt"), "w", encoding="utf-8") as f:
        f.writelines(lemmatized_sents)