TM Editor Guidelines

1. Instructions for Aligning TMs from Pre-Segmented Text Files Using InterText:

To segment the texts we are using a convenient open source application called InterText. You may download InterText here; it is very light weight and it can be run locally on Windows, Mac OS, or Linux.

I have made the following screencast that should show you everything you need to know:

Additionally, there is an online PDF guide for using InterText, although the guide contains a lot of documentation that is not necessary to read through, to simply work from the app you should just need to read part II, chapters 7–9 here.

For 84000’s TM project, we will provide you with the two .txt files that you will be aligning from InterText. The two texts have been prepared with a script (for anyone applying this methodology to another TM project, documentation for these scripts may be found here.)

As mentioned in the tutorial, when you upload the .txt files use the following dialog settings: For the Tibetan, set both "Paragraphs" and "Sentences" to be separated by "line breaks". For the English, set the "Paragraphs" to be separated by "line breaks" and the "Sentences" to be "automatically segment text using profile: default"; then when it asks to automatically align using Hungalign, click "Okay".

Once you have become familiar with the interface, please read through and use the following guidelines while you are editing the alignment of the texts. These standards are summarized in a cheatsheet at the end.

2. TM Standards

The following is a set of recommended standards for segmenting translation memories according to what can be loosely be understood as the “sentence” or “complete thought” found in the source-Tibetan. The examples here all use English as the target language, although since these standards focus on the Tibetan grammar, it is hoped that this methodology may be adapted to be paired with any target-language.

When creating the segmentation, it is expected that there will be some degree of subjectivity onpart of the TM editor defining the length, start, and end of each segment. After surveying many different scenarios, source-genres, and translation styles, we have determined that it is necessary to leave these rules somewhat flexible. If the rules are too rigidly defined, there will inevitably be scenarios where too many of the TM segments will be too long to be of any use, and this is particularly the case with classical Tibetan texts that notoriously use frequent run-on phrases. However, if we can create some general principles and guidelines that can be agreed upon, then our TM resources will be much more useful, both for translators retrieving their own past translations for consistency and recall, as well allowing TMs to be archived and shared between translators.

i General Principle:

A “segment” of the text constitutes what can be understood as one complete thought. In principle, it should be the most basic phrase-level unit of text that can be correctly understood without relying on any grammatical modifiers that may happen to precede or follow the segment. A segment should include at least one primary clause containing (A) a subject, which may be actually stated or just implied, (B) a verb, and (C) any grammatical adjuncts connected with that verb (objects, prepositions, participles, dependent clauses, etc.). A segment may necessarily contain multiple clauses if those clauses are dependent upon another main clause to be correctly read. The segment should thus be the smallest span of text that the translator will conceptually translate as one unit. Here is a simple example:

བྲམ་ཟེའི་ ཁྱེའུ་ ཉེ་ རྒྱལ་ དབེན་པ་ ལ་ དགའ་ བས་ གནས་ མལ་ དགོན་པར་ སོང་ སྟེ །
As the young brahmin Upatiṣya enjoyed solitude, he had gone to live in the forest,

A good way of thinking about segments in terms of length, is to read the passage out loud, and see where one would naturally punctuate with a pause before going on to the next statement. These pauses will likely be where we want to break up the segments. Abstractly, this will be more or less what is equivalent to an English sentence; however, the segmentation needs to be done from the perspective of the Tibetan grammar itself and not from its English translation. Tibetan only defines a full stop grammatically with the རྫོགས་ཚིག་ or completion particle (e.g., -འོ་, བོ་, སོ་ etc...). Unfortunately these particles come too few and far in between to make useful segmentation, so we need to determine some additional breaking points that will break passages of one or more clauses into eloquent and suitable segments.

Placing Segment Breaks According to Inflected Verbs:

Because in Tibetan, the typical grammatical order is subject, object, verb, a single clause always ends in an inflected verb (inflected into the past, present, future, etc.). These verbs need to be the focal point for determining when a break should and should not occur between any two clauses. Therefore the process for segmenting the Tibetan generally involves:

1) reading through the text and stopping to inspect each inflected verb,
2) determining if the clause governed by that verb along with all of its own grammatical adjuncts can stand on its own and doesn’t depend on the clause following it to determine its meaning, or vice versa.

If the answer to 2) is yes, then a break should be added after that clause. Note that it will almost always include an additional particle following the inflected verb whether it be a case particle (གིས་, གི་, ནས་, ན་, -ས་, etc…) or non-case particle(s) (ཡང་, སྟེ་, དེ་, ན་ཡང་ etc...) which should also be included in the segment. This final particle will also need to be considered, as it will be an important factor for determining the relationship between the preceding clause and the one following it.

I avoid calling these segments “independent clauses” because the presence of this final particle following the inflected verb would actually, in most all cases, make the clause a dependent one if it the Tibetan particles were forced into the parameters of English grammar. But in principle, any one segment should be able to stand on its own; the segment should include all the essential adverbs, locatives, and other adjuncts that are directly associated with the action of the verb.

Here are a few examples of segments created using this methodology:

དེ་ནས་ བྱང་ཆུབ་ སེམས་དཔའ་ འཕགས་པ་ སྤྱན་རས་ གཟིགས་ དབང་ཕྱུག་ དང་ །_ བྱང་ཆུབ་ སེམས་དཔའ་ ལག་ ན་ རྡོ་རྗེ་ ཏིང་ངེ་འཛིན་ དེ་ལས་ ལངས་ ནས་
Then, the bodhisattvas Noble Avalokiteśvara and Vajrapāṇi emerged from their state of concentration, and

བཅོམ་ལྡན་ འདས་ ག་ལ་བ་ དེར་ སོང་ སྟེ་
went to where the Blessed One was staying.

ཕྱིན་པ་ དང་ །_ ལན་ གསུམ་ དུ་ བསྐོར་བ་ བྱས་ ཏེ
They approached, circumambulated him three times, and

བཅོམ་ལྡན་འདས་ ལ་ འདི་སྐད་ ཅེས་ གསོལ་ ཏོ །_།
said to him,

བཅོམ་ལྡན་འདས་ དེ་བཞིན་ གཤེགས་པ་ རྣམས་ ཀྱི་ ཐབས་མཁས་པ་ དང་སེམས་ ཅན་ རྣམས་ ཡོངས་སུ་ སྨིན་པ ར་ བགྱི་བ་ ནི་ མང་ ངོ་ །_།
“Blessed One, the Tathāgata’s skillful means and methods that bring beings to spiritual maturity are many.

བདེ་བ ར་ གཤེགས་པ་ མང་ ངོ་ །_།
Sugata, they are many indeed.

བཅོམ་ལྡན་འདས་ དེ་བཞིན་ གཤེགས་པ འི་ གཟི་བརྗིད་ དང་ རྫུ་འཕྲུལ་ གྱི་ མཐུ ས་ བྱང་ཆུབ་ སེམས་དཔའ་ སེམས་དཔའ་ ཆེན་པོ་ དང་ །_ ཉན་ཐོས་ ཆེན་པོ་ དང་ །_ ལྷ་ དང་ །_ ཀླུ་ དང་ །_ གནོད་སྦྱིན་ དང་ །_ དྲི་ཟ་ དང་ །_ ལྷ་མ་ ཡིན་ དང་ །_ ནམ་མཁའ་ ལྡིང་ [146b]དང་ །_ མི འམ་ ཅི་ དང་ །_ ལྟོ་འཕྱེ་ ཆེན་པོ་ དང་ །_ རྒྱལ་པོ་ དང་ །_ བློན་པོ་ དང་ །_ བྲམ་ཟེ་ དང་ །_ ཁྱིམ་བདག་ དང་ །_ དགེ་སློང་ དང་ །_ དགེ་སློང་ མ་ དང་ །_ དགེ་བསྙེན་ དང་ །_ དགེ་བསྙེན་མ་ རྣམས་ མང་ དུ་ འདུས་ སོ །_།
Blessed One, the bodhisattva mahāsattvas, the great śrāvakas, gods, nāgas, yakṣas, gandharvas, asuras, garuḍas, [146.b] kinnaras, mahoragas, kings, ministers, brahmins, householders, monks, nuns, and male and female lay vow holders have gathered here in great numbers through the strength of the Tathāgata’s majesty and supernatural powers.”

You may notice with these examples that there are some gray areas when determining when an adjectival clause is considered to be an adjunct to an adjacent clause. In such cases you should use your own judgment as to when a clause should be split. In such consideration the length of the clause should also be taken into consideration; an exceedingly long TM segment will be less useful as a resource. However, when a long list of nouns is the subject or object of a verb as in the last segment in the example above, then these need to be joined with their governing verb for the segment to be correctly understood, even if it is quite long.

Note that in almost all cases any final particles immediately following the last inflected verb should be included at the end of the segment. Also, if the segment ends on a double shad, “།_།”, the break should always occur after the second and final shad, as seen in the examples above. This follows the Tibetan rules of grammar, and we should never see a shad occuring at the beginning of a segment.

Segmenting from the Perspective of the Tibetan’s Own Grammar:

As mentioned, the segmentation of the TMs should be governed by the Tibetan grammar rather than the punctuation and grammar found in its English translation. This is because the TMs will be recalled in future translation projects and we want to make them universally applicable to any new translation project. If the Tibetan is segmented like this in a consistent way, then when a new text to be translated on a CAT platform like OmegaT is run through the script and segmented following this same methodology, the TM data will yield the optimal amount of matches.

Therefore, the process of editing the segments should disregard the periods, phrasing or other grammatical aspects found in the English translation. Of course, the English translation will often naturally match up with the Tibetan segmentation, but we want to avoid taking too many cues from the English when determining when and when not to break. That being said, reading the English will be helpful for understanding the Tibetan text and finding where its own “complete thoughts” start and end. Especially if your Tibetan reading skills are still developing, then following the Tibetan by reading along with the English will be essential for the TM editing process.

Giving preference to the Tibetan grammar does present an obvious challenge when the English translation has compounded two Tibetan segments together and intermingled the words in the English, which would prevent you from being able to make a clean break. In this situation there is a simple solution involving duplicating and bracketing the English; it is described in detail along with some examples in section 2.iii.b below, on editing the English segmentation.

ii Editing the Tibetan Segmentation:

As mentioned the Tibetan and English .txt files that you will be aligning will first be pre-segmented with a script. For the Tibetan, a script is used that first word-tokenizes all the words and particles according to parts of speech and then creates segments with single line breaks after certain infected verbs depending on the types of particles following them.

Ideally we want the script to do 50-70% of the work in terms of creating the breaks. A few of these breaks will need to be corrected and merged again either because it erroneously tokenized a verb or particle, or it broke at a conjunction that really needs to be joined with its following clause to be understood as a complete thought.

Then a few additional breaks will need to be added for conjunctions that are not broken by default, but do in fact mark the boundary of a complete thought.

Pre-segmentation Performed by the Pybo-Script:

We will continue to update the script as we go, but generally the script will identify the inflected verbs (i.e., not the nominalized ones containing the markers +པ་/བ་/པར་ etc...), and create a break under the following conditions:

A segment break will be made after an inflected verb followed by a completion particle (རྫོགས་ཚིག) followed by a shad ། ending.

ཐམས་ཅད་ ཀྱང་ ཆོས་ ཉན་པ ར་ འདོད་པ ར་ གྱུར་ ཏོ །_།

A segment break will be made after an inflected verb followed by a source case ནས་ or ལས་ particle.

དེ་ནས་ དེ འི་ ཚེ་ ན་ ས་ ཆེ ར་ གཡོས་པ ར་ གྱུར་ ནས་

A segment break will be made after an inflected verb followed by a continuative ལྷག་བཅས་ particle, སྟེ་, ཏེ་, or དེ་, (note that the last one may be a demonstrative pronoun that has been misidentified by the script).

དེ་ནས་ བཅོམ་ལྡན་འདས་ སེང་གེ འི་ ཁྲི་ དེ་ཉིད་ ལ་ བཞུགས་ ཏེ་

A segment break will be made after an inflected verb followed by a ན་ case particle that will typically mark a conditional/temporal clause.

མི་ ཅིག་ ནམ་མཁའ་ ལ་ མེ་ཏོག་ གཏོར་ ན་

A segment break will be made after an inflected verb that has a ། but no other particle following it.

ཡོངས་སུ་ མྱ་ངན་ ལས་ འདའ་བ་ ཡང་ སྟོན །_

Where to Merge (Correcting Breaks Made by the Script):

Just because the script makes a break does not mean it should be unequivocally accepted. You should still inspect each inflected verb to see whether that break makes sense. You can use your own judgment and common sense, but let’s examine some of the scenarios where you would want to correct the script by merging a break in the Tibetan.

(Note, to merge two chunks of text into a single TM segment you simply need to place them into the same cell from InterText and they will be merged in the exported TM. You don’t need to merge the units themselves.)

In the following examples an asterix “*” will represent a break created by the script:

Verb or Particle Misidentified by the Script:

As mentioned the script may misidentify some words. Especially homographs, two words that share the same spelling, or when a verb is actually being used as a noun. Please look out for such occurrences and use your common sense to identify and merge breaks that shouldn’t be there:

ཡང་དག་པར་ ལྟ་ ནས་*
From correct view...

The script identified ལྟ་ as the verb, “to look,” but it is actually the noun, “view” (as in someone’s conceptual or ideological perspective).

After ལས་/ ནས་ Particle that Marks Simultaneous Actions, Reason, or Otherwise Connects One Clause to Another in a Significant Way:

Usually when the ablative particle ལས་/ནས་ is placed after a inflected verb, it will be a simple conjunction indicating a sequential clause following it and be translated into English as “and then” or just “and”. Therefore, we can usually break each of these clauses into independent segments. However, sometimes the ནས་ may signify that the actions of the two clauses it joins are simultaneous, which will typically be correlated in the English with a translation like “while”:

རྟ་པས་ རྟ་ བཞོན་ ནས་* མེ་ མདའ་ རྒྱབ་ སོང་།
The horseman fired the gun while riding on the horse.

An incorrect break after inflected verb བཞོན་ ནས་

Also the ནས་ may indicate a stronger correlation between the preceding and following clauses. For instance the prior clause that may indicate the reason or place of the following one, or for some other reason the prior clause may be a necessary adjunct of the following one. Consider the following segments:

སེམས་ སྡུག་ ནས་* བུ་ ངུས །
Distraught, the boy cried.

བླ་མ་ དྲན་ ནས་* ཁོང་ བཤུམས །
Remembering her guru, she wept.

ལྷ འི་ བུ་ དག་ གང་ སེམས་ཅན་ ལ་ལ ས་ སངས་རྒྱས་ བཅོམ་ལྡན་འདས་ མྱ་ངན་ ལས་ འདས་ ནས་* ལོ་ བརྒྱ་ སྟོང་ ངམ །_ བསྐལ་པ་ བྱེ་བ་ ལོན་ ཡང་
Gods, even though one thousand years, an eon, or even ten million eons may have elapsed since the Bhagavān Buddha entered parinirvāṇa,

In all these cases there will certainly be some gray areas. You should use your own common sense judgment for determining when the prior clause can be understood standing on its own, or needs to be merged as an adjunct to the following one.

After a ལྷག་བཅས་ Particle (སྟེ་, ཏེ་, or དེ་):

Usually the ལྷག་བཅས་ particle will precede a summary, reiteration or elaboration of the preceding clause or sentence. It is usually suitable to break here, and so that is why the script has been set up to break on clauses followed by a ལྷག་བཅས་, but again, use your own judgement for determining cases when the two segments should be merged:

དེ་འོད་དེས་རེག་སྟེ་*བསྐུལ་ནས། The light touched and inspired him

Here for example the two verbs “རེག་” and “བསྐུལ་” share the direct object and agent. They are also two short actions that are sensibly part of the same complete thought.

After a ན་ Particle:

Although the script will break after a ན་ case particle following an inflected verb, this is one scenario that needs to be confirmed with some careful discernment. Usually the ན་ marks a conditional or temporal clause, “when...” or “if...” so technically this prior clause is an adjunct that describes the condition, time, or cause of the following clause, however, we are suggesting a break be placed here because in most cases the conditional or temporal clause can in fact stand on its own, and doesn’t require the main clause to be understood. Again, here you should read the clause to yourself and use your common sense to judge whether or not the break should be there. Here are some examples of acceptable breaks performed by the script after ན་ particles:

རྒྱལ་བའི་ ཆོས་ རྣམས་ སྨོན་པར་ བྱེད་པའི་ མདོ་སྡེ་ འདི །_། གཞན་ ལ་ ཡུད་ ཙམ་ གཅིག་ ཅིག་ སྟོན་པར་ བྱེད་ གྱུར་ ན།
“Yet if for just a brief moment, Another teaches others this sūtra, That points to the Dharma of the victorious ones,

དེ་ ནི་ དེ་ བས་ བསོད་ནམས་ འབྲས་བུ་ ཁྱད་ པར་ འཕགས །
The fruit of his merit will exceed the former.

གལ་ཏེ་ ཆར་ འབེབས་པར་ མི་ བྱེད་ ན །
However, if the nāgas do not send rain,

མཛེས་ ཐེབས་པར་ འགྱུར་ རོ །
leprosy will break out.

བཅོམ་ལྡན་འདས་ ཡུལ་ སྤོང་ བྱེད་ ན །
When the Bhagavān wandered in the land of Vṛji,

ལྗོངས་ རྒྱུ་ ཞིང་ གཤེགས་པ་ དང་ ། གྲོང་ སྤྱིལ་བུ་ཅན་ དུ་ བྱོན་ ཏེ །
he arrived at the village of Kuṭigrāmaka and

These pairs segments can clearly be understood on their own. It is also fine, as in the final segment above, for the subject to be only implied it the second clause, since this is common to most Tibetan clauses and isn’t necessary to understand the action of the clause.

However, in other cases, the two clauses will be more dependently linked and they should be merged together:

གྲུབ་ པར་ གྱུར་ ན །* སངས་རྒྱས་ ཉིད་ ཀྱང་ སྟེར་བར་ བྱེད་ དོ །
Being accomplished, she can grant even the state of awakening.

Here the first clause “གྲུབ་ པར་ གྱུར་ ན །” is not just a condition for the following clause, but there is a sense that it is instrumental in the action of “སྟེར་བར་ བྱེད་”, it is the means through which the granting is done. Therefore, the first clause needs to be included with the second one because it describes and modifies how that second action is done. This particular segment is again a gray area, and this call should be made according to your own judgement and with the full context of the passages that precede and follow.

ཡབ་ ཡུམ་ བདག་ ཅག་ བཏང་ ན་* ལེགས་ ཏེ་
Father, Mother, it is excellent that you let us go.

These two clauses need to be joined because the clause before the ན་ is the subject of the implied linking verb in the second clause coming after the ན་.

དེ་དག་ ངས་ ཆོས་ བསྟན་ ན་* ཤེས་ པར་ འགྱུར་ ནས །
They will understand the Dharma taught by me, and

Here the action of the first verb “བསྟན་ ” is the object of the verb “ཤེས་” in the second clause, so these two clauses should be kept merged into a single segment.

Sometimes the conditional statement may be so brief that it should be included in the following clause if it intuitively seems part of the same complete thought. For example:

དེ་ དག་ བྱས་ ན་* མྱ་ ངན་ མེད །
Then later you will have no regrets.

To make such judgement calls, It is good to ask yourself whether the segment is a useful piece of information. Exceedingly short TMs of just a few syllables certainly will not provide a translator with any useful information when it is recalled, especially here when the translator has bent the meaning slightly with “དེ་ དག་ བྱས་ ན་ = Then later”. By itself, this doesn’t tell you anything, and needs to be given a bit more context to be a useful TM.

Short Series of Actions:

Often a series of short actions defined by inflected verbs will be defaultly broken by the script. These should be joined, especially if they are the subject or object of another verb in another clause:

འཁོར་ གྱི་ དཀྱིལ་འཁོར་ དེ་དག་ ཆོས་ དང་ ལྡན་པ འི་ གཏམ་ གྱིས་ ཚིམ་པ ར་ བྱས །_ *བསྐུལ །_ *གཟེངས་བསྟོད །_ ཡང་དག་ པར་ དགའ་བ ར་ བྱས་ ཏེ །_
he pleased his surrounding retinue with a teaching on Dharma, and encouraged, uplifted, and complimented them.

The three verbs at the end of this segment are follow ups to the verb “ཚིམ་པ ར་ བྱས །”, they should be merged, especially since they all share the same object, “འཁོར་ གྱི་ དཀྱིལ་འཁོར་ དེ་དག་” stated in the first clause.

Final རྫོགས་ཚིག་, or Completion particles (-འོ།, མོ།, སོ།, etc...):

Since the རྫོགས་ཚིག་ completion particle is a full stop, then this is the one case where we are 95% sure that a segment would end here. However, there are still some cases where you will want to change or readjust these breaks after the completion particle made by the script, for example:

དེ་དག་ ནི་ ཆོས་ མ་ བསྟན་ ན་ ནི་ ཤེས་ པར་ མི་ འགྱུར་ རོ་* སྙམ་ མོ།
He thought, “but they will not understand the Dharma if I do not teach it.”

“སྙམ་ མོ། / He thought,” is too short to be a useful TM, so the break after the first completion particle should be removed. However, if a verb or thought or speech is longer than just two syllables and contains adjoining subjects, adverbs, etc. then it should be more likely be set apart as it’s own segment.

Where to Break (Making Addition Breaks Missed by the Script):

In addition to correcting unwanted breaks made by the script, you should add additional breaks following the guidelines that follow.

The “$” symbol will be used in all the following examples to show where a manually entered break is needed:

Misidentified Words:

Again, the script will miss some inflected verbs, usually because they were misidentified as nouns or particles. For instance, currently the script will miss the imperative verb form, གྱིས་ of the verb, བགྱིད་ (“to do, perform”) because it will consider it to be the instrumental case particle. We haven’t adjusted the script to account for this because the instrumental particle is more common.

མི་མཆོག་ ལ་ ནི་ མྱུར་ དུ་ བལྟ་བ ར་ གྱིས །_།$
Quickly, behold the supreme person!

The break here (“$”) needs to be added manually.

Inflected Verbs Followed by Other Case or Non-case Particles that Are Not Segmented by The Script:

The script does not break after an inflected verb followed by a case particle other than the ablative ནས་/ ལས་. It will be less common for a complete thought to end on such clauses, however, there will be some cases where it does, and you should inspect every inflected verb and add the break if the preceding clause can stand on its own.

This will happen frequently with the relational particle (ཀྱི་/གྱི་/གི་/-འི་/ཡི་) and 2nd/4th/7th case particle (ལ་/སུ་/ན་/-ར་/སུ་/དུ་/རུ་/ཏུ་), and sometimes, though less frequently with the instrumental (ཀྱིས་/གྱིས་/གིས་/-ས་/ཡིས་). For example:

སེམས་ ཀྱི་ ངོ་བོ་ སྟོང་ ཡིན་ ལ་$
The essence of mind is empty, but

སེམས་ ཀྱི་ རང་བཞིན་ གསལ་ ཡིན །
The nature of mind is luminous.

བདག་ རྟག་པ་ མེད་པ་ ཡིན་ གྱིས་$
A permanent self is non-existent but

བདག་ཉིད་དེ་ སེམས་ འཁུལ་བར་ སྣང་།
That very self appears to mistaken mind

ཏིང་འཛིན་ ཞི་བ་ བསྒོམ་པ་ ཐོབ་ འགྱུར་ གྱི །_།$
You will attain the state of concentration, the cultivation of peace, and

For other non-case particles breaks may similarly be applied, but in each case be sure that what precedes and follows the break can stand on its own. There should be no adverbs or modifiers distributing between both clauses, nor should the preceding clause be the object/subject of the following one or vice versa. Some examples:

བཅོམ་ལྡན་འདས་ ཆུ་བོ་ ཆེན་པོ་ ཀླུང་ ནཻ་ རཉྫ་ ནཱའི་ འགྲམ་ ལ་ གཤེགས་ དང་
The Blessed One went to the banks of the great Nairañjanā River, and

དེར་རྒྱལ་པོ་ དང་ །_ བློན་པོ་ དང་ །_ བྲམ་ཟེ་ དང་ །_ ཁྱིམ་བདག་ ཐམས་ཅད་ ལ་ ཆོས་ བསྟེན་ ནོ །
There he taught the dharma to kings, ministers, brahmins, and householders.

ལྷ འི་ བུ་ དག་ གང་ སེམས་ཅན་ ལ་ལ ས་ སངས་རྒྱས་ བཅོམ་ལྡན་འདས་ མྱ་ངན་ ལས་ འདས་ ནས་* ལོ་ བརྒྱ་ སྟོང་ ངམ །_ བསྐལ་པ་ བྱེ་བ་ ལོན་ ཡང་$
Gods, even though one thousand years, an eon, or even ten million eons may have elapsed since the Bhagavān Buddha entered parinirvāṇa,

Note, here the segment before “འདས་ ནས་” was merged because the following verb “ལོན་” modifies it by indicating its location in time (thus the English translation “since”).

In general, try to apply the breaks to non-case particles less often, since these scenarios are often less clear, it’s best to avoid breaking on them if you have any doubts. But, if it is sensibly clear that a break should be there, go ahead and add it. Also, if a segment is getting very long, in the 40+ syllable range, it may be good to see if there are any of these additional breaks that may be applied in a valid way.

In general we should not place breaks after nominalized verbs (+ -པ་/བ་/པར་ etc...). However, there are two cases where a break after a nominalized verb makes sense. Consider the two following cases:

Verbs of Quoted Speech and Thought:

Usually verbs of speech such as “བཅོམ་ལྡན་འདས་ཀྱིས་བཀའ་སྩལ་པ། The Blessed One said” or “དེས་སྨྲས་པ། They replied.” Are stated before or after quoted speech and the verb is nominalized (“བཀའ་སྩལ་པ།” or “སྨྲས་པ།”). In these cases we actually should break these phrases off as their own segments separated from their quoted speech. The same applies to verbs of quoted though “སྙམ་པ་”. This is because they don’t need to be included in the quoted passage to be understood; often times they are quite verbose and omitted in the English translation; and so including them in the quoted speech would make those TMs longer and less likely to trigger fuzzy matches.

Therefore these statements of thinking or speech should be sectioned off into their own segments:

འདི་ལྟ་ སྟེ །_ བཅོམ་ལྡན་ ཀྱིས་ བཀའ་ སྩལ་པ །
For the Buddha has declared,

དེ་ ནས་ འཇམ་དཔལ་ གཞོན་ནུར་ གྱུར་ པས་ ལིད་ཙ་བཱི་ དྲི་ མ་ མེད་པར་ གྲགས་པ་ ལ་ སྨྲས་པ །
Thereupon, Mañjuśrī, the crown prince, addressed the Licchavi Vimalakīrti,

If the statement is entirely omitted in the English, as is often the case when such statements become redundant, then leave that segment blank in the English cell:

འཇམ་དཔལ་ གྱིས་སྨྲས་པ །

There are of course exceptions to this rule: if a modifier is inserted in the middle of the speech; if clause of speech is exceedingly short; or there is some other unexpected but sensible reason, then the verb should be included with the quoted speech:

བདག་ ནི་ དེར་ འགྲོའོ་ སྙམ་ མོ །
he thinks, ‘I shall go there.’

ཚིག་ ཏུ་ སྨྲས་ པས་ འདི་ ང་ ཡི་ དཔང་ ཡིན་ ཏེ །
And said, “This earth is my witness.

ཤེས་ལྡན་ དག་ ཨེ་མ་འདི་ དམག་ མང་པོ་ དང་ ལྡན་ནོ་ ཞེས་ སྨྲས་པས།
He said, “Gentlemen! It is awesome to behold!”

Long Descriptive Passages Using Nominalized Verbs

There are sometimes long passages of text containing a continuous chain of nominalized verbs giving a description of some place or object or describing a sequence of events. Generally we want to avoid breaking on nominalized verbs, but if these passages are particularly long (running for 40+ syllables), then it should be examined to see if parts of the passage can be separated according to clear themes. This is one exception where we should look to the English translation to help identify themes, as the translator has likely already identified any themes in a run-on phrase and broken them into sentences or with a semicolon. For example:

ས་གཞི་ ལག་མཐིལ་ ལྟར་ མཉམ་ ལ །_ ལྷ འི་ ཡིད་ དུ་ འོང་བ་ ཁ་དོག་ དང་ ལྡན་པ །_ དྲི་ དང་ ལྡན་པ །_
The ground became as smooth as the palm of a hand, divinely pleasing to the mind, colorful, and fragrant.

ལྷ འི་ མེ་ཏོག་ གི་ ཤིང་ དང་ །_ འབྲས་བུ འི་ ཤིང་ དང་ །_ སྤོས་ ཀྱི་ ཤིང་ དང་ །_ རིན་པོ་ཆེ འི་ ཤིང་ དང་ །_ དཔག་བསམ་ གྱི་ ཤིང་ དང་ །_ གོས་ ཀྱི་ ཤིང་ རྣམས་ ཀྱིས་ མཛེས་པ ར་ བྱས་པ །_
It was ornamented with heavenly flower trees, fruit trees, fragrant trees, jewel trees, wish-fulfilling trees, and trees bearing garments.

ལྷ འི་ སེང་གེ འི་ ཁྲི་ དང་ ལྡན་པ །_ རིན་པོ་ཆེ་ དང་ །_ དར་ དང་ །_ མེ་ཏོག་ གི་ ཆུན་པོ་ རབ་ ཏུ་ དཔྱངས་པ །_ ལྷ འི་ དྲིལ་བུ འི་ སྒྲ ས་ བརྒྱན་པ ར་ གྱུར་ ཏེ
It supported heavenly lion thrones with hanging garlands of jewels, cloth, and flowers, and was suffused with the sound of divine bells.

This long descriptive passage can be separated into three themes “ground,” “trees,” “throne.” Although it can be debated whether the first two should go together. There should be a clearly distinguished theme for each segment that sensibly makes up a complete thought. You can break up large passages in this manner, but if you have any doubts whether the segments might be misunderstood on their own, then it is better to leave it as a larger segment.

iii Editing English Segmentation:

The following section explains all the guidelines regarding the English segments and how they should be matched to the Tibetan. Note that the script for the 84000 project will add several references to the English including: folio references, [1.b], [2.a], etc…; milestone references, $1, $2, $3, etc…; and note references #1, #2, #3, etc. (Other TM projects may wish to use similar notation, as these references may be converted into suitable markup in the finalized .tmx file)

In general you can ignore these, except the note references will refer you to that endnote as it appears in the 84000 Reading Room, and it should be checked if you suspect there is an error or alternative source used in the passage (see instructions below).

Do keep in mind that the milestone references (marked with “$” signs) should never be placed at the end of segment, if it seems to be placed there, it should instead it should come at the beginning of the following one. A note reference on the other hand (marked with the “#” sign), if it appears in between two segments should always come at the end of segment e.g.,

NOT like this:

leads them to the extinction of their suffering in the sphere of remainderless parinirvāṇa.#4 $30

“He liberates them from the eight unwholesome factors and sets each of them on the eightfold noble path.

like this:

leads them to the extinction of their suffering in the sphere of remainderless parinirvāṇa.#4

$30 “He liberates them from the eight unwholesome factors and sets each of them on the eightfold noble path.

Changing the Sentence Order in the English

Since the segments themselves are the foundational units to be documented, the order they appear is not important. So often the TM editors should reorder the English phrases or sentences; this is perfectly acceptable and necessary in many cases especially since the grammatical word order is so different in Tibetan and English. So for example:

An English and Tibetan passage with three phrases A,B, and C could be arranged in the .tmx file as follows:

Tibetan	English
A	B
B	C
C	A

As described in the InterText demo, CTRL-X may be used to swap and English segment with the one above, although the “cross-align” feature needs to be enabled.

For example, consider the following two segments:

བསྐལ་པ་ བྱེ་བ་ ཕྲག་ བརྒྱ་ གྲངས་མེད་ དུ ། _ ། སྲིད་པ འི་ རྒྱ་མཚོ ར་ ངེས་པ ར་ འཁོར་བ་ དང་ ། _ ། ཉོན་མོངས་ དོག་པ ར་ རྟག་ ཏུ་ འཁོར་ ན ། _ །

འཚོ་བ་ མ་རུངས་པ་ ནི་ སྐྲག་ མི་ འགྱུར ། _ །

“Those of unwholesome livelihood have no fear,

Even though they are bound to circle In saṃsāra’s ocean for countless eons, Always ensnared by afflictions.

The English must be matched to the segments as follows:

བསྐལ་པ་ བྱེ་བ་ ཕྲག་ བརྒྱ་ གྲངས་མེད་ དུ ། _ ། སྲིད་པ འི་ རྒྱ་མཚོ ར་ ངེས་པ ར་ འཁོར་བ་ དང་ ། _ ། ཉོན་མོངས་ དོག་པ ར་ རྟག་ ཏུ་ འཁོར་ ན ། _ །
Even though they are bound to circle In saṃsāra’s ocean for countless eons, Always ensnared by afflictions.

འཚོ་བ་ མ་རུངས་པ་ ནི་ སྐྲག་ མི་ འགྱུར ། _ །
“Those of unwholesome livelihood have no fear,

The order of the English segments as they appear from InterText is no issue, so long as they are completely correlated to the Tibetan segments they are matched to.

Separating Compounded English Segments

As has been mentioned, sometimes the English translation has compounded two Tibetan segments together and intermingled the words in the English, which would prevent you from being able to make a clean break. Sometimes it makes sense to just bend the rules a little bit or make a larger segment, as long as the resultant TM isn’t too long and can be clearly understood. However, if this can’t be done in an eloquent way, there is another solution:

From InterText you should copy the full English passage in which the Tibetan segments have been intermingled.
Create a new translation unit in InterText (Edit → “Insert Element” or I) then paste a copy of that full passage into the new alignment box so the same passage is matched with both the Tibetan segments and add square brackets for any English words that are not found in the matched Tibetan segment.

For example, consider the following four Tibetan segments which needs to be linked to the English passage:

ཡོངས་སུ་ མྱ་ངན་ ལས་ འདའ་བ་ ཡང་ སྟོན །
སྐྱེ་བ་ ཡང་ སྟོན །
འཁོར་ལོ ས་ སྒྱུར་བ འི་ རྒྱལ་པོ་ ཡང་ སྟོན །
རྩེ་བ་ དང་ ། དགའ་བ་ དང་ ། བུད་མེད་ ཀྱི་ བཞད་གད་ དང་ ། ཀུ་རེ་ དང་ ། དྲི་ དང་ ། ཕྲེང་བ་ དང་ ། དགའ་ ཞིང་ རྩེ་བ་ ཡང་ སྟོན ། །

He manifests as being in the state of parinirvāṇa, as being born, as a universal monarch, and also as someone joyful who is entertained by amusements and pleasures such as women’s laughter, play, perfume, and garlands.

Since each of these Tibetan segments is a complete clause with the final inflected verb “སྟོན་”, The English should be reduplicated and bracketed when matched to each Tibetan segment in the following manner:

ཡོངས་སུ་ མྱ་ངན་ ལས་ འདའ་བ་ ཡང་ སྟོན །_
He manifests as being in the state of parinirvāṇa,

སྐྱེ་བ་ ཡང་ སྟོན །_
He manifests [as being in the state of parinirvāṇa,] as being born,

འཁོར་ལོ ས་ སྒྱུར་བ འི་ རྒྱལ་པོ་ ཡང་ སྟོན །_
He manifests [as being in the state of parinirvāṇa, as being born,] as a universal monarch,

རྩེ་བ་ དང་ །_ དགའ་བ་ དང་ །_ བུད་མེད་ ཀྱི་ བཞད་གད་ དང་ །_ ཀུ་རེ་ དང་ །_ དྲི་ དང་ །_ ཕྲེང་བ་ དང་ །_ དགའ་ ཞིང་ རྩེ་བ་ ཡང་ སྟོན །_།
He manifests [as being in the state of parinirvāṇa, as being born, as a universal monarch, and also] as someone joyful who is entertained by amusements and pleasures such as women’s laughter, play, perfume, and garlands.

You can see this process demonstrated in the InterText tutorial from minute 17:18.

Note that the first three segments do not require the full English passage, it is sufficient to simply end the English passage when all the Tibetan words have been matched, however the last segment requires the full passage in order to state the verb “manifests”, “སྟོན་” as it is stated in the Tibetan.

With this format, the next translator using the TMs will then be able to discern that the bracketed text is not actually contained in the Tibetan, but contained in another segment, and the additional text will not be confused and even provide further context that may prove to be useful information.

As a side note, in general you will notice that the English translation will often exchange pronouns for proper names and vice versa, state the subject when it is only implied in the Tibetan, or omit the subject when it can be implied in the English. You need not mind such scenarios, as they are common and easily understood from the context of the text. There is no need to bracket or alter the English in such cases. The technique described above is really only necessary when you need to--in a sense--pass through one segment to reach another in the English.

Under no circumstances should you retranslate the English. For this technique, you should only copy and paste the full English passage and bracket any text that is not included in the linked Tibetan. If you believe there are actual errors in the translation, please follow the guidelines for this case in the next topic below.

Punctuation:

Include all of the English punctuation and quote marks as is. They should be fit into the segments as is sensible, and it is no problem if an opening quote mark is contained in a segment without a matching closing quote mark.

Conjunctions “and” and “but” translated for “ནས་” “ལས་” and similar particles:

As the conjunctive particles that govern Tibetan clauses come attached immediately after the final inflected verb, and since we are segmenting from the perspective of the Tibetan grammar, when this particle is translated in English as a conjunction such as “and”, “but”, “then”, etc… This should be reflected as it is in the English, even if it seems to be hanging at the end of a clause from the perspective of the English:

ངན་འགྲོ་ ཐམས་ཅད་ ལས་ ཡོངས་སུ་ ཐར་བ ར་ མཛད་ ནས་
He liberated them from all the negative migrations, and

སེམས་ ཀྱི་ ངོ་བོ་ སྟོང་ ཡིན་ ལ་
The essence of mind is empty, but

Verses

The formatting of lines into verses should not influence your choices for segmenting the Tibetan, nor will it be affected by the script. So we aren’t directly taking the structure of the verses into account as a cue for our segmentation, although you may very well find that they will tend to naturally match up with the clauses and segments.

Note this is contrary to the older version of the TM guidelines.

Words or Phrases Omitted or Added within the English Translation:

We would like to keep a comprehensive and complete bitext of the translation in our record. Therefore all the text for both the Tibetan and English need to be complete in the .xml alignments file. As mentioned before, for phrases that are omitted in the English, it is permissible to leave empty entries in the Tibetan or English, e.g.:

འཇམ་དཔལ་ གྱིས་སྨྲས་པ །
[Here the cell in the English column should be intentionally left blank]

Presumably the English translator left out “Mañjuśrī said,” because they deemed it redundant or unnecessary in the English reading.

Generally added phrases or elaborations found in the English should be included if that elaboration corresponds to a matching Tibetan segment. However, there may be rare cases where there is a heading or something similar that has no matching Tibetan, and in such a case the English may be matched to an empty Tibetan entry.

Although all these segments with empty entries will be preserved in the record of .xml alignment, by default they will be removed by InterText in the exported .tmx record, which shouldn’t be a problem since these empty TMs aren’t actually useful for translators working on CAT platforms.

iv Flagging Problematic Segments

In addition to segmenting the TMs you should also be checking for and flagging any potentially problematic segments. This includes erroneous translations or translations that were made from an alternate source. The flagging system we use is as follows:

Marking Errors:

While editing the TM segments, you may occasionally come across seeming translation errors in the published texts. 84000 publications go through a rigorous editorial review, but occasionally there may be some translation errors, typos, or a section of text that was deleted by mistake. Since you are closely reviewing the Tibetan-English correlation, you are in an advantaged position for noticing errors that may have slipped through into the final publication.

My hope is that the translations are well polished and this won't be necessary, but if you do believe you have found an error please first check:

Is this feasibly a stylistic choice made by the translator, in which the English translation doesn't follow the literal wording or grammar found in the Tibetan source but does represent a sensible understanding of the meaning?
Is there an end note in this passage found in the Reading Room that explains an alternate source or other reasoning for omitting or changing a passage? In general you should keep a window open of the note’s section of the published text in the Reading Room in order to reference the end notes if something seems off. All the notes will be marked in the English following a hashtag “#” sign, so you will be able to easily check the corresponding note from the Reading Room page.
Is this a redundant phrase such as a part of speech (such as "The Blessed One said,"). That could sensibly be omitted from the English for better readability, since it can be obvious from context?

If the answer to these three questions is no, then this is likely an error that should be addressed and you should first enter it into the revision sheet, and then a single % character should be typed into the beginning of the English segment. This will create the flag that indicates there is an error to be reviewed:

%It dispels all obscurity and drkness,
མུན་ནག་ ཐིབས་པོ་ ཐམས་ཅད་ ཀྱང་ རྣམ་པར་ སེལ །_

“darkness” is misspelled. This can be used to flag common typos as well. Please don’t correct the typo because it will need to be corrected in the publication as well, and we need to be able to see it specifically in order to do so.

If the error is obvious, then adding the % flag should be sufficient; it will be reviewed, but you can also record the error in this google sheet if it seems to need a specific explanation. If you are uncertain if it is an error or not, you can err on the side of caution and flag it for review.

Note, that the editorial review process is slow, so it may take a while to correct the passage in the publication, but you will be making a beneficial contribution to the final publication of the Buddhavacana.

Again here is the link to the corrections spreadsheet.

Alternative Sources:

Sometimes there will be a sentence or phrase where the translator has used a different source then the Tibetan such as Sanskrit, Chinese, or a Kangyur edition different from the Derge. Such instances should be declared in the translators introduction and notes. Please use the endnote references “#” to check the notes for any declared alternate sources in the reading room, and if you find an alternative source add a ! character to the beginning of the Tibetan segment, and it will be reviewed when finalizing the text. Even though the TM will be matched to the Derge eKangyur, it may in fact still be a useful TM, however it definitely needs to be flagged if it is translated from an alternate source. Please match the segment as best you can, it is fine if all the words don’t match as long as it is flagged. For example:

!བག་མེད་པ་ ཡི་ དབང་ དུ་མ་ འགྲོ་ ཞིག _།
Do not fall under the power of delusion!#3

Here note 3 in the Reading Room says: Translated from the Cone edition གཏི་མུག་. Derge reads “བག་མེད་པ་ ”carelessness”.

The Tibetan is flagged with a ! character.

Note that 84000 publications use the Vienna sigla conventions for referring to the different versions of the Kangyur, in which the Derge is indicated by a “D”.

v Judgment Calls

This concludes our guidelines for creating TMs. The purpose for making them so precise is because following a standard will optimize the TMs value and usefulness when used from CAT platforms. Having the pybo-catscript makes it very convenient for translators to pre-segment the text in an efficient way, and then they will get the most out of the TM resources.

Following these standards will make the TMs fairly consistent, but we also realize a degree of subjectivity is inevitable, so it is fine to use your own judgement and common sense. The TM should essentially be the fundamental unit or block that translators will use to process their work. It is good to imagine how you would reference the TMs in your own work. If you want some more samples for how to edit the segments, please see the following examples from Toh 186, The Teaching on the Extraordinary Transformation that is the Miracle of Attaining the Buddha’s Powers in this google sheet:

We would also like to hear from you if you have any feedback or suggestions for how you might standardize TMs or set up the script in your own work. If so please contact Celso at celso.wilkinson@gmail.com

vi Aligning Texts That Were Translated From Alternate Sources (Sanskrit or Other)

Occasionally some 84000 texts are translated primarily from a Sanskrit source rather than the Tibetan found in the Degé Kangyur. The translator will state the source they used in in the introduction, but I will alert you if this is the case for any TM assignment (otherwise you can disregard this section).

At some point we may begin creating Sanskrit-English alignments, but in the meantime although the Sanskrit source may differ, we have found that it is still worthwhile to create a Tibetan-English alignment, as these alignments may provide valuable clues for difficult passages. Note that as a TM editor, you do not need to actually know Sanskrit to work on these alternate source texts; however, there are some special instructions for creating these alignments that differs a bit from the regular guidelines described above:

Flags

For these texts with an alternate source we can presume that the "!" alternate source flag will be automatically applied to all the segments. Therefore, you do not need to worry about adding this flag, as it will be added by default when the alignment files are finalized. However it will be likely that you will see a lot of Tibetan or English text that has no correspondent source or target.

I've done the TMs for texts like this before, and since the Tibetans took such a literal approach to their translations back when they were written, the Kangyur often aligns with the English translation quite well despite this incongruity with the source. I think you will see that sometimes the text lines up quite nicely, but sometimes they are totally off or an omission all together. So basically your job as the aligner is to line up the segments that do align or closely align (in the 90-100% range) and for everything else just leave that cell, whether Tibetan or English, next to an empty cell in that corresponding row. This causes the segment to be omitted in the final exported TMX file, but keeps the alignment recorded in our XML file.

So in some ways this process is easier because you don't need to worry about checking the notes or making any flags, but it does require you to use your own intuitive judgement. You should be thinking of the use of the final TMX file from the perspective of a translator recalling the TMs while working on a new project, when the TM is recalled it will include the default flag indicating that the TM is not a direct correlation and the translator will be aware of this. However, even if the TM isn't a direct match, it can still be quite a useful piece of information and shed some light on a difficult passage. That is why in this case I'm giving you an approximate (90-100%) leeway to create the match since the criteria here isn't necessary that the TM is a 100% match, but rather that it is a useful reference.

Likewise the "%" flag for dubious English should be overridden in the case of these texts since they will often not match up exactly. However, there is one exception: Please do flag any instances of typos or blatant grammatical errors you happen to notice in the English, as this will notify me to make necessary corrections to the publication.

vii Cheat Sheet

A summary of these segmentation standards are outlined in this cheat sheet here.

viii A Note About Backing Up Alignment Files

When you create and begin editing an alignment on InterText, InterText doesn't edit the imported text files but creates three .xml files in a separate repository on your computer. InterText will save the files in this repository and you can open them from the menu "Alignment -> Open from repository". If you are working on a large text I recommend that you backup these files occasionally in case your computer crashes or such issues occur. The simplest way to do this is from the InterText menu "Alignment -> export". This will export 3 .xml files into a location you select. These three .xml files are essentially a snapshot of your current alignment and can be used to recover your work. To recover an alignment in this way, you will need to place all three files in the same directory and then from the InterText menu "Alignment -> import" navigate to the location of the 3 files and select the one that aligns the other two--in this case it will be the one with the ending "-bo.en.xml". It then automatically imports the alignment into the repository from the source and target contained in the two other .xml files.

These three .xml files are small data files, so they can easily be saved by emailing them to yourself or saving them in a dropbox folder etc.

Important

I recommend that you save a backup of your current assignment before updating your operating system. It has been reporting that updating Mac OS have in some cases caused the InterText repository to be lost. However, if this happens please check with your manager, it may be the case that the repository is still located somewhere on your computer but has simply moved to a location that InterText is not able to locate.

InterText Updates

InterText was not developed by 84000 but is an opensource app created by a third party. Occasionally when you open InterText, it may prompt you to make an update. You should know that the current version of InterText supports all of needs for our 84000 TM project, so it isn't necessary to update. But if you do, you might make a backup of your files just in case. In one case, an editor reported that updating moved the repository and they needed to relocate it on their computer, after which it was preserved.

Automatic Backups

If you are concerned about losing work, you could set up another app for creating automatic backups of the repository, it would just be a matter of locating the repository on your computer. The location of this directory is a bit odd, on PC it is located in User/AppData/Local/InterText (You will need to display "hidden" items to locate the "AppData" folder). The location should be similar on Mac OS (if you find it on your Mac please report it to me so I can add the location to these guidelines). You can also try searching your computer for the name of your alignment followed by ".bo.en.xml" e.g. "toh999.bo.en.xml".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly