Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.

TreeTagger part-of-speech tagging models for Sahidic Coptic

Version: 1.10 (includes POS tagging and lemmatization, with DDGLC Greek lemma information) Model source file: coptic_fine10.par / coptic_coarse10.par

The part-of-speech tagging models are for use with the freely available TreeTagger ( The models are based on the guidelines of the Coptic SCRIPTORIUM project, which closely follow Layton's (2011) grammar. The lexicon used by the tagger is based on a lexicon kindly provided by Prof. Tito Orlandi and the CMCL project ( and a lemma list provided by Prof. Tonio Sebastian Richter and the DDGLC project ( Please cite the CMCL and DDGLC projects whenever publishing research using the tagging models.

There are two different models: one for the coarse grained tagset, with 22 tags, and one for the fine grained tagset, which distinguishes 44 tags (including individual tags for each positive and negative conjugation base). For details on the tagset, see the documentation on the Coptic SCRIPTORIUM web page.

To use the models, download and unzip the TreeTagger. In the folder bin/ you will find the TreeTagger excutable, which requires one of the two parameter files to run. TreeTagger also expects an input file in a one-token-per-line format. For exaple, the input file input.txt could include the following tokens (in UTF-8! The ascii characters below are for illustration purposes only):


These will be tagged as:

noute	N
pe	COP

To run the tagger, run the TreeTagger excutable as follows (Windows example):

tree-tagger.exe coptic_fine.par -token input.txt output.txt

Or to include lemmas in a third column in the output use:

tree-tagger.exe coptic_fine.par -token -lemma input.txt output.txt

The option -token tells the TreeTagger that the input is already tokenized. For a Coptic tokenizer, see the Coptic SCRIPTORIUM project web page. Further options, such as allowing for SGML tags in the input or outputting the word form as a lemma when the lemma is unknown, are documented in the TreeTagger documentation. For the coarse grained tags use coptic_coarse.par instead of coptic_fine.par.