Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.

Coptic SCRIPTORIUM Tokenization Script

This Perl script tokenizes a running text of Sahidic Coptic word forms into their constituent morphemes, ideally compatible with the Coptic SCRIPTORIUM part-of-speech tagging guidelines. The script expects space separated word forms in UTF-8, e.g. (the UTF-8 equivalent of):

afsotp nqiphllo

Is tokenized into


In order to retain the original word form border, the script optionally outputs the whole word as well in an SGML tag aruond each group of tokens. The script ignores inline SGML tags, so you can tokenize files that already include tags, even within words. It is therefore possible to tokenize the UTF-8 equivalent of teishime ('this woman' with an annotated initial tao).

Usage: [options]

Options and argument:

  • -h print this [h]elp message and quit
  • -d specify [d]ictionary file, default is in script directory
  • -s specify [s]egmentation file, default is in script directory
  • -m specify [m]orph table file, default is in script directory
  • -p output [p]ipe separated word forms instead of tokens in separate lines wrapped by tags
  • -l add [l]ine tags marking original linebreaks in input file
  • -n [n]o output of word forms in <norm_group> elements before the set of tokens extracted from each group


Tokenize a Coptic plain text file in UTF-8 encoding (without BOM): in_Coptic_utf8.txt > out_Coptic_tokenized.txt

  • v2.X.X adds the -l parameter, which allows tokenization of text that breaks across lines (as in a diplomatic transcription of a manuscript)
  • v3.X.X adds a file with most frequently used previous segmentations called This file should be in the same directory as the tokenizer and the lexicon file,
  • v4.X.X adds morphological analysis and a corresponding file with known morphological segmentations,

Copyright 2013-2016, Amir Zeldes

This program is free software. You may copy or redistribute it under the same terms as Perl itself.

This script relies on a lexicon derived from materials kindly provided by Prof. Tito Orlandi of the CMCL project. Please cite CMCL whenever using this script for your publications.