# CLTK on treebanks
### This script running in a venv on Python 3.9.7, was used to apply POS tagging with the CLTK tagger on the treebank test data. 

In [1]:
from platform import python_version

print(python_version())

3.9.7


In [2]:
from cltk import NLP

2023-11-06 12:15:38.310007: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


This snippet is from [here](https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb)

In [3]:
# Load the default Pipeline for Latin
cltk_nlp = NLP(language="lat")
# Removing ``LatinLexiconProcess`` for this demo b/c it is slow (adds ~9 mins total)
cltk_nlp.pipeline.processes.pop(-1)
# print(cltk_nlp.pipeline.processes)

‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`.


cltk.lexicon.processes.LatinLexiconProcess

### ITTB

In [4]:
# open the ittb file that contains only the tokens of the test set
with open("../random-training-data-other-taggers/test_tok_ittb.txt") as ittb:
    # read and strip the lines
    content = ittb.read().strip()
# and analyze the content with cltk
cltk_doc = cltk_nlp.analyze(text=content)

# get all the tokens in a list
toks = cltk_doc.tokens
# get all the pos tags in a list
pos_tags = cltk_doc.pos

# write output token, pos into files
with open("../random-training-data-other-taggers/test_sets/cltk_ittb.txt", "w") as out:
    for ind,token in enumerate(toks):
        # add newline character at end of every line for formatting
        row = token+"\t"+pos_tags[ind]+"\n"
        out.write(row)

Unrecognized UD `feature_name` ('ConjType') with `feature_value` ('Expl').
Please raise an issue at <https://github.com/cltk/cltk/issues> and include a small sample to reproduce the error.


### LLCT

In [5]:
with open("../random-training-data-other-taggers/test_tok_llct.txt") as llct:
    content2 = llct.read().strip()
cltk_doc2 = cltk_nlp.analyze(text=content2)

toks2 = cltk_doc2.tokens
pos_tags2 = cltk_doc2.pos

# write output token, pos into files
with open("../random-training-data-other-taggers/test_sets/cltk_llct.txt", "w") as out2:
    for ind,token in enumerate(toks2):
        row = token+"\t"+pos_tags2[ind]+"\n"
        out2.write(row)

### UDante

In [6]:
with open("../random-training-data-other-taggers/test_tok_udante.txt") as udante:
    content3 = udante.read().strip()
cltk_doc3 = cltk_nlp.analyze(text=content3)

toks3 = cltk_doc3.tokens
pos_tags3 = cltk_doc3.pos

# write output token, pos into files
with open("../random-training-data-other-taggers/test_sets/cltk_udante.txt", "w") as out3:
    for ind,token in enumerate(toks3):
        row = token+"\t"+pos_tags3[ind]+"\n"
        out3.write(row)

### Perseus

In [7]:
with open("../random-training-data-other-taggers/test_tok_perseus.txt") as perseus:
    content5 = perseus.read().strip()
cltk_doc5 = cltk_nlp.analyze(text=content5)

toks5 = cltk_doc5.tokens
pos_tags5 = cltk_doc5.pos
        
# write output token, pos into files
with open("../random-training-data-other-taggers/test_sets/cltk_perseus.txt", "w") as out5:
    for ind,token in enumerate(toks5):
        row = token+"\t"+pos_tags5[ind]+"\n"
        out5.write(row)

Because cltk has problems with tagging the PROIEL data in one go, I used a different approach to try to navigate around that the kernel does not die during tagging.

In [25]:
# read the file
with open("../random-training-data-other-taggers/test_tok_proiel.txt") as proiel:
    content4 = proiel.readlines()


In [26]:
# i go through the list token by token and analyze it
analyzed = []
for tok in content4:
    analyzed.append(cltk_nlp.analyze(text=tok))

In [27]:
# put all tokens in a list and all pos tags in a list
toks4 = [el.tokens for el in analyzed]
pos_tags4 = [el.pos for el in analyzed]

In [28]:
# write output token, pos into files
with open("../random-training-data-other-taggers/test_sets/cltk_proiel.txt", "w") as out4:
    for ind,token in enumerate(toks4):
        row = token[0]+"\t"+pos_tags4[ind][0]+"\n"
        out4.write(row)