# Reading Target Data

The target translation is tokenized and formatted as TSV, with additional columns with other attributes. 

See [the index](00Index.ipynb) for the requirements to run this notebook.

## Contents

* [Target Token Attributes](#Target-Token-Attributes)
* [Corpus Properties](#Corpus-Properties)

## Target Token Attributes

Here's sample data for Mark 1:1 in the Berean Standard Bible:

| id | source_verse | text | skip_space_after | exclude | id_range_end | source_verse_range_end |
| -- | ------------ | ---- | ---------------- | ------- | ------------ | ---------------------- |
| 41001001001 | 41001001 | This |  |  |  |  |
| 41001001002 | 41001001 | is |  |  |  |  |
| 41001001003 | 41001001 | the |  |  |  |  |
| 41001001004 | 41001001 | beginning |  |  |  |  |
| 41001001005 | 41001001 | of |  |  |  |  |
| 41001001006 | 41001001 | the |  |  |  |  |
| 41001001007 | 41001001 | gospel |  |  |  |  |
| 41001001008 | 41001001 | of |  |  |  |  |
| 41001001009 | 41001001 | Jesus |  |  |  |  |
| 41001001010 | 41001001 | Christ | y |  |  |  |
| 41001001011 | 41001001 | , |  | y |  |  |
| 41001001012 | 41001001 | the |  |  |  |  |
| 41001001013 | 41001001 | Son |  |  |  |  |
| 41001001014 | 41001001 | of |  |  |  |  |
| 41001001015 | 41001001 | God | y |  |  |  |
| 41001001016 | 41001001 | . |  | y |  |

Selected attribute documentation:
* The `id` attribute uniquely identifies this token in the corpus. The `biblelib` library has utilities for working with this format (`biblelib.word.bcvwpid`). 
* The `source_verse` attribute indicates the matching book, chapter, and verse in the source text. While this represents the same value the majority of the time, there are many cases where versification differs between target and source, and even cases where some words are moved across verse boundaries. 
* The `text` attribute represents the surface text. Punctuation is also included, and is normally separated as its own token. 
* The `skip_space_after` attribute allows reconstructing a readable text from the sequence of tokens. The default value is false (left unmarked), and assumes space-delimited tokens. A `y` value indicates that a space should not be added after this token (for example, before most punctuation tokens in English). 
* The `exclude` attribute indicates tokens that should not be aligned (for example, punctuation). 

More details on the values for these attributes can be found in the Target Corpora Documentation under `explanation`. 


In [1]:
# setup
from bible_alignments.burrito import DATAPATH, TargetReader

# read the BSB data for the New Testament
ntbsb = TargetReader(DATAPATH / "eng/targets/BSB/nt_BSB.tsv")
# ntbsb is a dictionary mapping token identifiers to Target instances
ntbsb["41001001010"]

<Target: 41001001010>

In [2]:
mrk1_1_10 = ntbsb["41001001010"]
print("Basic attributes for Mark 1:1.10 (BSB):")
print(f"identifier:\t{mrk1_1_10.id}")
# book/chapter/verse portion of identifier
print(f"bcv:\t\t{mrk1_1_10.bcv}")
print(f"text:\t\t{mrk1_1_10.text}")
# tuple of id and text
print(f"idtext:\t\t{mrk1_1_10.idtext}")
print(f"source_verse:\t{mrk1_1_10.source_verse}")
print(f"skip_space_after: {mrk1_1_10.skip_space_after}")
print(f"exclude:\t{mrk1_1_10.exclude}")
print()
print("Properties and methods:")
print(f"same_source_verse: {mrk1_1_10.same_source_verse}")
print(f"_display:\t{mrk1_1_10._display}")
print(f"asdict():\t{mrk1_1_10.asdict()}")


Basic attributes for Mark 1:1.10 (BSB):
identifier:	41001001010
bcv:		41001001
text:		Christ
idtext:		('41001001010', 'Christ')
source_verse:	41001001
skip_space_after: True
exclude:	False

Properties and methods:
same_source_verse: True
_display:	41001001010: Christ		 ('', False, False)
asdict():	{'id': '41001001010', 'text': 'Christ', 'source_verse': '41001001', 'skip_space_after': 'y', 'exclude': ''}


## Corpus Properties

You can collect the tokens matching a given term, either preserving or ignoring case, with the `term_tokens()` method of the `TargetReader` class. 

In [3]:
# case-sensitive
ntbsb.term_tokens("Repent")

[<Target: 40003002005>,
 <Target: 40004017011>,
 <Target: 41001015020>,
 <Target: 44002038005>,
 <Target: 44003019001>,
 <Target: 44008022001>,
 <Target: 66002005012>]

In [4]:
# case-insensitive
ntbsb.term_tokens("Repent", lowercase=True)

[<Target: 40003002005>,
 <Target: 40004017011>,
 <Target: 40011020022>,
 <Target: 40021032037>,
 <Target: 41001015020>,
 <Target: 41006012011>,
 <Target: 42013003010>,
 <Target: 42013005010>,
 <Target: 42015007031>,
 <Target: 42016030025>,
 <Target: 42017004022>,
 <Target: 44002038005>,
 <Target: 44003019001>,
 <Target: 44008022001>,
 <Target: 44017030017>,
 <Target: 44026020029>,
 <Target: 66002005012>,
 <Target: 66002005027>,
 <Target: 66002016002>,
 <Target: 66002021009>,
 <Target: 66002022026>,
 <Target: 66003003015>,
 <Target: 66003019013>,
 <Target: 66009020016>,
 <Target: 66009021006>,
 <Target: 66016009029>,
 <Target: 66016011017>]