## Tokenization, or "Tokenize Your Own"

Why tokenize yourself?

By default CollateX "splits on whitespace and interpunction". This often that you will get words to be your unit of collation.

But what about "can't", "A'dam", "Peter's"?

Tokenization defines your *unit of comparison* and thus gives you more control over the comparison

Unfortunately this means learning how to offer a *pretokenized* *data structure* to CollateX


## Pretokenized text as JSON

Assert:

    witness_A = "Peter's cat"
    witness_B = "Peter's dog"

Problem: "Peter's cat" will be tokenizes as "Peter", "'", "s" , "cat"

We will use JSON to tell CollateX to 'read' and collate it differently

JSON data is a mixture of arrays and objects:

Arrays look like this: 

    [ "i", "am", "the", "words", "in", "an", "array" ]

Onbjects look like:

    { "a_variable_name": "My first value", "another_varible": "Another thingy" }
    
In JSON you may combine this:

    { "a_witness_object": { "siglum": "A", "tokens": [] } } 
    
Or the same, layed out so that it is somewhat easier to read:

    { "a_witness_object":
        { 
            "siglum": "A",
            "tokens": []
        }
    }
    


## Let's try the good old fashion way…

In [2]:
from collatex import *

In [20]:
collation = Collation()
collation.add_plain_witness( "A", "Peter's cat")
collation.add_plain_witness( "B", "Peter's dog" )
alignment_table = collate( collation, layout='vertical', segmentation=False )

In [21]:
print( alignment_table )

+-------+-------+
|   A   |   B   |
+-------+-------+
| Peter | Peter |
+-------+-------+
|   '   |   '   |
+-------+-------+
|   s   |   s   |
+-------+-------+
|  cat  |  dog  |
+-------+-------+


## Hmm.. indeed not quite what we want, thus…

In [3]:
tokens_a = [ { "t": "Peter's" }, { "t": "cat" } ]
tokens_b = [ { "t": "Peter's" }, { "t": "dog" } ]

In [5]:
witness_a = { "id": "A", "tokens": tokens_a }

In [6]:
print( witness_a )

{'id': 'A', 'tokens': [{'t': "Peter's"}, {'t': 'cat'}]}


In [7]:
witness_b = { "id": "A", "tokens": tokens_b }

In [8]:
JSON_input = { "witnesses": [ witness_a, witness_b ] }

In [15]:
result = collate_pretokenized_json( JSON_input, output='table' )

In [16]:
print( result )

+---+---------+-----+
| A | Peter's | cat |
| A | Peter's | dog |
+---+---------+-----+
