# Reading Source Data

Source data (Hebrew or Greek) is tokenized and formatted as TSV, with additional columns with other attributes. 

See [the index](00Index.ipynb) for the requirements to run this notebook.

## Contents

* [Source Token Attributes](#Source-Token-Attributes)
* [Corpus Properties](#Corpus-Properties)

## Source Token Attributes

Here's sample data for the first five words from Mark 1:1:

| id | altId | text | strongs | gloss | gloss2 | lemma | pos | morph |
| -- | ----- | ---- | ------- | ----- | ------ | ----- | --- | ----- |
| n41001001001 | Ἀρχὴ-1 | Ἀρχὴ | G0746 | [The] beginning | beginning | ἀρχή | noun | N-NSF |
| n41001001002 | τοῦ-1 | τοῦ | G3588 | of the | the | ὁ | det | T-GSN |
| n41001001003 | εὐαγγελίου-1 | εὐαγγελίου | G2098 | gospel | gospel | εὐαγγέλιον | noun | N-GSN |
| n41001001004 | Ἰησοῦ-1 | Ἰησοῦ | G2424 | of Jesus | Jesus | Ἰησοῦς | noun | N-GSM |
| n41001001005 | χριστοῦ-1 | χριστοῦ | G5547 | Christ | Christ | Χριστός | noun | N-GSM |

Selected attribute documentation:
* The `id` attribute uniquely identifies this token in the corpus. 
    * The "n" prefix identifies it as a New Testament token, for consistency with Macula.
    * The format is BBCCCVVVWWW, representing book, chapter, verse, and word. For Hebrew corpora, there is an additional word part identifier (so BBCCCVVVWWWP). The `biblelib` library has utilities for working with this format (`biblelib.word.bcvwpid`). 
* The `text` attribute represents the surface text. Note that source corpora do not include punctuation in the surface text. 
* The `gloss` attribute provides English glosses for the text, typically with some contextual information. 
* The `lemma` attribute represents the dictionary form of the word. This can be joined to lexicon data, depending on the lexicon format. 
* The `pos` attribute represents part of speech. 
* The `morph` attribute represents morphological information.

More details on the values for these attributes can be found in the Source Corpora Documentation under `explanation`. 


In [None]:
# setup
from bible_alignments import SOURCES
from bible_alignments.burrito import SourceReader

# read the SBLGNT data
sblgnt = SourceReader(SOURCES / "SBLGNT.tsv")
# sblgnt is a dictionary mapping token identifiers to Source instances
sblgnt["n41001001004"]

In [None]:
mrk1_1_5 = sblgnt["n41001001004"]
print("Basic attributes for Mark 1:1.5:")
print(f"identifier:\t{mrk1_1_5.id}")
# book/chapter/verse portion of identifier
print(f"bcv:\t\t{mrk1_1_5.bcv}")
print(f"text:\t\t{mrk1_1_5.text}")
# tuple of id and text
print(f"idtext:\t\t{mrk1_1_5.idtext}")
print(f"gloss:\t\t{mrk1_1_5.gloss}")
print(f"lemma:\t\t{mrk1_1_5.lemma}")
print(f"pos:\t\t{mrk1_1_5.pos}")
print(f"morph:\t\t{mrk1_1_5.morph}")
print()
print("Properties and methods:")
print(f"is_content():\t{mrk1_1_5.is_content()}")
print(f"is_noun():\t{mrk1_1_5.is_noun()}")
print(f"_is_pos('verb'): {mrk1_1_5._is_pos('verb')}")
print(f"_is_pos('noun'): {mrk1_1_5._is_pos('noun')}")
print(f"_display:\t{mrk1_1_5._display}")
print(f"asdict():\t{mrk1_1_5.asdict()}")

## Corpus Properties

You can aggregate the values of different attributes for tokens in the corpus with the `vocabulary()` method of the `SourceReader` class. 

In [None]:
# the total number of tokens
len(sblgnt)

In [None]:
# the size of the token vocabulary
len(sblgnt.vocabulary())

In [None]:
# the size of the lemma vocabulary
lemmavocab = sblgnt.vocabulary(tokenattr="lemma")
print(f"Number of lemmas: {len(lemmavocab)}")
# the first 10, as examples. Note that the lemmas are case-sensitive and accented. 
lemmavocab[:10]

In [None]:
# the values for part of speech: not so many, so we can just list them. 
sblgnt.vocabulary(tokenattr="pos")