# Parsing programs

_Hannes Vlaardingerbroek, ETCBC_

hannes@vlaardingerbroek.nl

Data from many sources

- Different file formats, transcriptions etc

- In ETCBC: PIL files, AT/AN files

- outside: Word Documents, compact text databases (SEDRA III), less compact databases

Transcriptions

![tr_table](imgs2/tr_table.png)

ETCBC Parsing programs
 
 - ETCBC has internal programs for parsing the data

 - For use in Python, need to parse output of those parsers

 - Or instead, make Python parsers

PIL parser
- to generate running text of Peshitta edition (removes variants and comments)

```
@Ex2
 1 w'z#l gbr' mn dbyt lwy^ wns#b lbrt lwy=.;
 2 wbTn#t 'ntt' wyldt br'=. wHzth d$pyr h#w=. wT$yth tlt' yr"Hyn=.;
 3 wl' '$k#Ht twb lmT$ywth=. wnsbt lh^ qbwt' [dqys'/+9b1, 9l2, 10j1, 11l2,
   12a1fam, 12b1, 12b2, L, M, U] d`rq'=. w$`th^ bkwpr' wbzpt'=. wsm#t bh^
   Tly' [l-/+12a1, L, M, U] =.  wsmth^ brqq' `l spth dnhr'=.;
```

```
@Ex2
 1 w'z#l gbr' mn dbyt lwy^ wns#b lbrt lwy=.;
 2 wbTn#t 'ntt' wyldt br'=. wHzth d$pyr h#w=. wT$yth tlt' yr"Hyn=.;
 3 wl' '$k#Ht twb lmT$ywth=. wnsbt lh^ qbwt' d`rq'=. w$`th^ bkwpr' wbzpt'=. wsm#t bh^
   Tly' =.  wsmth^ brqq' `l spth dnhr'=.;
```

- AN parser
  - to use morpholically encoded ETCBC texts as input for morphological analyzer

```
  0,1 TWB                       TWB
  0,1 KTB>                      KTB=/~>
  0,1 DNM"WS>                   D-NMWS/(J~>
  0,1 D>TR"WT>                  D->TR/&WT=~>
  1,1 MN                        MN
  1,1 QDM                       QDM
  1,1 JWM"T>                    JWM/T=~>
  1,1 <LJN                      <L=[/JN
  1,1 HWJN                      HWJ[N
  1,1 LMS<R                     L-!M!S<R=[/
  1,1 LCMCGRM                   L-CMCGRM/
  1,1 >XWN                      >X/&W-N
  1,1 W>T>                      W->T(J&>[
  1,1 >CKXN                     ]>]CKX[-N
  1,1 TMN                       TMN
  1,1 BRDJYN                    BRDJYN/
```

```
0,1	D>TR"WT>	D->TR/&WT=~>
    D
	morphemes: (('lex', 'D'),)
	functions: (('nu', False), ('gn', False), ('st', False), ('vt', False), ('vs', False), ('ps', False), ('sp', 'prep'), ('ls', 'pcon'))
	lex      : ('7789', (('sp', 'prep'), ('ls', 'pcon'), ('gl', '(relative)')))
    >TR/&WT=~>
```

- SyrNT (parsing NT file shipped with Syromorph)
- SedraIII (parsing relational DB SEDRA III)

In [11]:
import syrnt, sedra
nt1 = sedra.BFBS(sedra.tosyr).words()
nt2 = syrnt.SyrNT(syrnt.tosyr)

In [30]:
w = next(nt2)

In [34]:
w.annotation

Annotation(stem='ܟܬܒܐ', lexeme='ܟܬܒܐ', root='ܟܬܒ', prefix='', suffix='', seyame=0, verbal_conjugation=0, aspect=0, state=3, number=1, person=0, gender=2, pronoun_type=0, demonstrative_category=0, noun_type=2, numeral_type=0, participle_type=0, grammatical_category=2, suffix_contraction=0, suffix_gender=0, suffix_person=0, suffix_number=0, feminine_he_dot=0)

## MorphAn

### Morphological analyzer based on transformations

Purpose: to recognize morphological patterns, and predict the probabilities of several possible taggings

Training text:

Used Syriac New Testament with annotations containing: tags (lexeme, PoS, GNP, etc)

Calculate transformation patterns between text form and lexeme form