Coptic SCRIPTORIUM Tokenization Script

This Perl script tokenizes a running text of Sahidic Coptic word forms into their constituent morphemes, ideally compatible with the Coptic SCRIPTORIUM part-of-speech tagging guidelines. The script expects space separated word forms in UTF-8, e.g. (the UTF-8 equivalent of):

afsotp nqiphllo

Is tokenized into

a
f
sotp
nqi
p
hllo

In order to retain the original word form border, the script optionally outputs the whole word as well in an SGML tag aruond each group of tokens. The script ignores inline SGML tags, so you can tokenize files that already include tags, even within words. It is therefore possible to tokenize the UTF-8 equivalent of teishime ('this woman' with an annotated initial tao).

Usage: tokenize_coptic.pl [options]

Options and argument:

-h print this [h]elp message and quit
-d specify [d]ictionary file, default is copt_lex.tab in script directory
-s specify [s]egmentation file, default is segmentation_table.tab in script directory
-m specify [m]orph table file, default is morph_table.tab in script directory
-p output [p]ipe separated word forms instead of tokens in separate lines wrapped by tags
-l add [l]ine tags marking original linebreaks in input file
-n [n]o output of word forms in <norm_group> elements before the set of tokens extracted from each group

Examples:

Tokenize a Coptic plain text file in UTF-8 encoding (without BOM):

tokenize_coptic.pl in_Coptic_utf8.txt > out_Coptic_tokenized.txt

v2.X.X adds the -l parameter, which allows tokenization of text that breaks across lines (as in a diplomatic transcription of a manuscript)
v3.X.X adds a file with most frequently used previous segmentations called segmentation_table.tab. This file should be in the same directory as the tokenizer and the lexicon file, copt_lex.tab.
v4.X.X adds morphological analysis and a corresponding file with known morphological segmentations, morph_table.tab.

This program is free software. You may copy or redistribute it under the same terms as Perl itself.

This script relies on a lexicon derived from materials kindly provided by Prof. Tito Orlandi of the CMCL project. Please cite CMCL whenever using this script for your publications.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
CondMerge.xla		CondMerge.xla
CondMerge_Mac.xla		CondMerge_Mac.xla
README.md		README.md
copt_lex.tab		copt_lex.tab
example_in.txt		example_in.txt
example_out.txt		example_out.txt
example_sgml.txt		example_sgml.txt
example_sgml_lines.txt		example_sgml_lines.txt
get_latest_segmentations.py		get_latest_segmentations.py
morph_table.tab		morph_table.tab
segmentation_table.tab		segmentation_table.tab
tokenize_coptic.pl		tokenize_coptic.pl
~$segmentation_table.xlsx		~$segmentation_table.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CondMerge.xla

CondMerge.xla

CondMerge_Mac.xla

CondMerge_Mac.xla

README.md

README.md

copt_lex.tab

copt_lex.tab

example_in.txt

example_in.txt

example_out.txt

example_out.txt

example_sgml.txt

example_sgml.txt

example_sgml_lines.txt

example_sgml_lines.txt

get_latest_segmentations.py

get_latest_segmentations.py

morph_table.tab

morph_table.tab

segmentation_table.tab

segmentation_table.tab

tokenize_coptic.pl

tokenize_coptic.pl

~$segmentation_table.xlsx

~$segmentation_table.xlsx

Repository files navigation

Coptic SCRIPTORIUM Tokenization Script

About

Releases 10

Packages

Contributors 3

Languages

CopticScriptorium/tokenizers

Folders and files

Latest commit

History

Repository files navigation

Coptic SCRIPTORIUM Tokenization Script

About

Resources

Stars

Watchers

Forks

Languages