In this notebook we analyze in detail two bible-book combinations that show the oddest transitions in notebook 34:

- (ita-x-bible-vita1997.txt, Revelation) has D_structure=0 for all word-splitting datapoints and word-pasting datapoints from 0 to 300 merges
- (etu-x-bible.txt, John) has a gap between the 0-points

Finally, we look at all bibles that have relatively large gaps at the 0-points.

In [None]:
import os
from word_splitting import create_word_split_sets, get_output_file_dir, get_entropies, join_verses, mask_word_structure
import matplotlib.pyplot as plt
import random
import data
from compression_entropy import read_selected_verses, get_char_distribution, select_samples
import os
from util import make_book_plot, BOOK_ID_NAME
from analysis import get_spearman
from compression_entropy import get_entropies as get_pasting_entropies

In [None]:
BIBLES_PATH = '../paralleltext/bibles/corpus/'
ITALIAN_BIBLE = 'ita-x-bible-vita1997.txt'
ITALIAN_BOOK = 'Revelation'
ITALIAN_BOOK_ID = 66
LOWERCASE = True                                        # from word_splitting.py
TRUNCATE_BOOKS = False                                  # from word_splitting.py
REMOVE_MISMATCHER_FILES = True                          # from word_splitting.py
N_MERGES = 30000                                        # from my standard run configuration on HPC
OUTPUT_PATH = 'output/KoplenigEtAl/WordSplitting/temp'  # local test path
MISMATCHER_PATH = '../KoplenigEtAl/shortestmismatcher.jar'

ETU_BIBLE = 'etu-x-bible.txt'
ETU_BOOK = 'John'

# ita-x-bible-vita1997.txt, Revelation

A quick look at the raw text does not reveal any strange features. So, I will run the program in debug mode and look at some of the calculations.

My first observation looking at the output numbers is that the D_order and D_structure are never 0, albeit very close. What is very strange is that, for the splitting experiment, D_structure ~ 0 for all instances. This means NONE of the information is contained in the word structure, i.e., 'masking' the words makes no difference.

I verified that indeed the entropy generated for the original and masked bibles were very close (0.02% off). My first hypothesis was that the characters in this bible were very rare, so I attempted the calculation again with the a-z, A-Z, and 0-7 characters, which were the same number in total as the characters in this book. But the calculations returned very similar numbers.

More tests to be run:

- try uniform weighting of characters (1.42% off), which would put D_structure=0.016
- check if D_structure=0.016 is atypical or typical for n_splits=0 -> xuo has 0.033, so maybe yes
- compare the length of the character set (60) to that of other bible-book combinations -> xuo has 58, which is comparable
- compare the character distribution to that of other bible-book combinations -> the distribution for the Italian bible is rather more centered at 0, though not enough to look particularly odd
- compare the length of the bible-book combination overall with other bible-book combinations -> the Italian book has 69686 characters. The xuo book has 88415. Again, not particularly different
- see what typical values are for other bibles with the same language and book

In [None]:
import pandas as pd

df = pd.read_csv('all_entropies.csv')

In [None]:
for bible, grp in df[(df['bible'].apply(lambda x: x.startswith('ita'))) & (df['bible'] != 'ita-x-bible-vita1997.txt') & (df['iter_id'] == 0) & (df['book'] == 'Revelation') & (df['experiment'] == 'splitting')].groupby('bible'):
    assert len(grp) == 1
    orig_masked = grp[['orig', 'masked']].values[0]
    print(bible, f'{orig_masked[1] - orig_masked[0]:.3f}')

Meanwhile, for my bible of interest:

In [None]:
grp = df[(df['bible'] == 'ita-x-bible-vita1997.txt') & (df['iter_id'] == 0) & (df['book'] == 'Revelation') & (df['experiment'] == 'splitting')]
assert len(grp) == 1
orig_masked = grp[['orig', 'masked']].values[0]
print(f'{orig_masked[1] - orig_masked[0]:.3f}')

This value is much lower than the previous ones. So let's look at the inner details.

### Compare the character weights

In [None]:
filename = os.path.join(BIBLES_PATH, ITALIAN_BIBLE)
selected_book_verses, char_counter = read_selected_verses(filename,
                                                          LOWERCASE,
                                                          [ITALIAN_BOOK_ID],
                                                          TRUNCATE_BOOKS)

In [None]:
bib_char_ctr = {}
bib_filename = {}
bib_sel_book_verses = {}
other_italian_bibles = df[(df['bible'].apply(lambda x: x.startswith('ita'))) & (df['bible'] != ITALIAN_BIBLE)]['bible'].unique()
for other_ita_bib in other_italian_bibles:
    other_filename = os.path.join(BIBLES_PATH, other_ita_bib)
    other_sel_book_verses, other_char_ctr = read_selected_verses(other_filename, 
                                                                 LOWERCASE, 
                                                                 [ITALIAN_BOOK_ID], 
                                                                 TRUNCATE_BOOKS)
    bib_filename[other_ita_bib] = other_filename
    bib_sel_book_verses[other_ita_bib] = other_sel_book_verses
    bib_char_ctr[other_ita_bib] = other_char_ctr

In [None]:
most_common_chars = [el[0] for el in char_counter.most_common(10)]
for char_ctr in bib_char_ctr.values():
    most_common_chars += [el[0] for el in char_ctr.most_common(10)]
most_common_chars = list(set(most_common_chars))
print(most_common_chars)

In [None]:
plt.bar(most_common_chars, [char_counter[ch] for ch in most_common_chars])

In [None]:
for bib, char_ctr in bib_char_ctr.items():
    print(bib)
    plt.bar(most_common_chars, [char_ctr[ch] for ch in most_common_chars])
    plt.show()

The character distributions are roughly equivalent, so no surprises there.

### Compare the values of D_structure

In [None]:
df[(df['bible'].apply(lambda x: x.startswith('ita'))) & (df['experiment'] == 'splitting') & (df['iter_id'] == 0) & (df['book'] == 'Revelation')][['bible', 'D_structure']]

Clearly a much lower value for our bible of interest, so the question remains.

### Compare the length of the character set

In [None]:
print(ITALIAN_BIBLE, len(char_counter.keys()))
for bib, char_ctr in bib_char_ctr.items():
    print(bib, len(char_ctr.keys()))

Nothing strange here.

### compare the length of the bible-book combinations overall

In [None]:
bib_sel_book_verses[ITALIAN_BIBLE] = selected_book_verses
for bib, sel_book_verses in bib_sel_book_verses.items():
    book_id_versions = create_word_split_sets(sel_book_verses, N_MERGES, OUTPUT_PATH, bib)
    n_pairs_verses = book_id_versions[ITALIAN_BOOK_ID]
    sample_verses = n_pairs_verses[0]
    verse_tokens = random.sample(sample_verses, k=len(sample_verses))
    joined_orig = join_verses(verse_tokens, insert_spaces=True)
    print(bib, type(joined_orig), len(joined_orig))

Rather normal in this way too.

## Further verifications

- Reproducir el resultado
- Chequear que las palabras estén reemplazadas
- Mirar las biblias a ojo
- Espacios
- Hacer el masking con seeds diferentes y ver si cambia la cantidad
- Indices de los versículos
- Repeticion de texto
- Commits más recientes en esta Biblia
- Hacer el análisis sólo con la mitad del libro

### Reproducir el resultado

In [None]:
df[(df['bible'].apply(lambda x: x.startswith('ita'))) & (df['book'] == 'Revelation') & (df['iter_id'] == 0) & (df['experiment'] == 'splitting')][['bible', 'D_structure']]

In [None]:
book_id_versions = create_word_split_sets(bib_sel_book_verses[ITALIAN_BIBLE], 
                                          N_MERGES, 
                                          OUTPUT_PATH, 
                                          ITALIAN_BIBLE)
n_pairs_verses = book_id_versions[ITALIAN_BOOK_ID]
verse_tokens = n_pairs_verses[0]
filename = os.path.join(BIBLES_PATH, ITALIAN_BIBLE)
base_dir = get_output_file_dir(OUTPUT_PATH, filename)
base_filename = os.path.join(base_dir, f'{os.path.basename(filename)}_{ITALIAN_BOOK_ID}_v{0}')
entropies = get_entropies(verse_tokens,
                                                       base_filename,
                                                       REMOVE_MISMATCHER_FILES,
                                                       char_counter,
                                                       MISMATCHER_PATH)

In [None]:
print(entropies)
print(entropies['masked'] - entropies['orig'])

This is the same as before. What about another Italian bible?

In [None]:
book_id_versions = create_word_split_sets(bib_sel_book_verses['ita-x-bible-riveduta.txt'], 
                                          N_MERGES, 
                                          OUTPUT_PATH, 
                                          'ita-x-bible-riveduta.txt')
n_pairs_verses = book_id_versions[ITALIAN_BOOK_ID]
verse_tokens = n_pairs_verses[0]
filename = os.path.join(BIBLES_PATH, 'ita-x-bible-riveduta.txt')
base_dir = get_output_file_dir(OUTPUT_PATH, filename)
base_filename = os.path.join(base_dir, f'{os.path.basename(filename)}_{ITALIAN_BOOK_ID}_v{0}')
entropies = get_entropies(verse_tokens,
                                                       base_filename,
                                                       REMOVE_MISMATCHER_FILES,
                                                       bib_char_ctr['ita-x-bible-riveduta.txt'],
                                                       MISMATCHER_PATH)
print(entropies)
print(entropies['masked'] - entropies['orig'])

Also similar to the value saved in the file.

### Chequear que las palabras estén reemplazadas

In [None]:
book_id_versions = create_word_split_sets(bib_sel_book_verses[ITALIAN_BIBLE], 
                                          N_MERGES, OUTPUT_PATH, ITALIAN_BIBLE)
n_pairs_verses = book_id_versions[ITALIAN_BOOK_ID]
sample_verses = n_pairs_verses[0]
# Randomize the order of the verses in each sample
verse_tokens = random.sample(sample_verses, k=len(sample_verses))
# Mask word structure
char_str = ''.join(char_counter.keys())
char_weights = [char_counter[el] for el in char_str]
masked = mask_word_structure(verse_tokens, char_str, char_weights)

In [None]:
sample_index = 3

In [None]:
print(' '.join([el.token for el in masked[sample_index][:30]]))

In [None]:
print(' '.join([el.token for el in verse_tokens[sample_index][:30]]))

The replacement seems to be working correctly.

### Mirar las biblias a ojo

Voy a comparar con una de las otras biblias en italiano.

In [None]:
with open(os.path.join(BIBLES_PATH, ITALIAN_BIBLE)) as f:
    bad_lines = f.readlines()
with open(os.path.join(BIBLES_PATH, other_italian_bibles[0])) as f:
    good_lines = f.readlines()
print('comparing', ITALIAN_BIBLE, 'and', other_italian_bibles[0])
bad_rev = [el for el in bad_lines if el.strip().startswith('66')]
good_rev = [el for el in good_lines if el.strip().startswith('66')]

In [None]:
print(len(bad_rev), len(good_rev))

In [None]:
sample_index = random.randint(0, len(bad_rev) + 2)
print(bad_rev[sample_index])
print(good_rev[sample_index])

No veo diferencias significativas.

### Espacios

In [None]:
assert bad_rev[sample_index][:9] == good_rev[sample_index][:9]

In [None]:
assert bad_rev[sample_index][10] == ' ' and good_rev[sample_index][10] == ' '

In [None]:
assert bad_rev[sample_index][-1] == good_rev[sample_index][-1]

### Hacer el masking con seeds diferentes y ver si cambia la cantidad

In [None]:
random.seed(10)
book_id_versions = create_word_split_sets(bib_sel_book_verses[ITALIAN_BIBLE], 
                                          N_MERGES, 
                                          OUTPUT_PATH, 
                                          ITALIAN_BIBLE)
n_pairs_verses = book_id_versions[ITALIAN_BOOK_ID]
verse_tokens = n_pairs_verses[0]
filename = os.path.join(BIBLES_PATH, ITALIAN_BIBLE)
base_dir = get_output_file_dir(OUTPUT_PATH, filename)
base_filename = os.path.join(base_dir, f'{os.path.basename(filename)}_{ITALIAN_BOOK_ID}_v{0}')
entropies = get_entropies(verse_tokens,
                                                       base_filename,
                                                       REMOVE_MISMATCHER_FILES,
                                                       char_counter,
                                                       MISMATCHER_PATH)
print('before:', entropies)
random.seed(30)
entropies = get_entropies(verse_tokens,
                                                       base_filename,
                                                       REMOVE_MISMATCHER_FILES,
                                                       char_counter,
                                                       MISMATCHER_PATH)
print('after:', entropies)

It's not the random seed.

### Indices de los versículos

In [None]:
bad_indices = [el[:8] for el in bad_rev]
good_indices = [el[:8] for el in good_rev]

In [None]:
assert bad_indices == good_indices

The indices are the same

### Repeticion de texto

- several indices are empty, while the preceeding verses are longer

In [None]:
[el for el in bad_rev if el[8:].strip() == '']

In [None]:
[el for el in good_rev if el[8:].strip() == '']

Furthermore, verse 66007004 has repeated text.

In [None]:
empty_verses = [int(el) for el in bad_rev if el[8:].strip() == '']

In [None]:
previous_verses = [el - 1 for el in empty_verses]

In [None]:
excluded_verses = sorted(set(empty_verses + previous_verses))
print('exclude', excluded_verses)

In [None]:
bible_entropies = {}
for bib in df[df['bible'].apply(lambda x: x.startswith('ita'))]['bible'].unique():
    filename = os.path.join(BIBLES_PATH, bib)
    chosen_books = [ITALIAN_BOOK_ID]
    # Read the complete bible
    bible = data.parse_pbc_bible(filename)
    # Tokenize by splitting on spaces
    tokenized = bible.tokenize(remove_punctuation=False, lowercase=LOWERCASE)
    book_verses = {k: v for k, v in tokenized.verse_tokens.items() if k.startswith(str(ITALIAN_BOOK_ID))}
    selected_verses = {k: v for k, v in book_verses.items() if int(k) not in excluded_verses}
    _, _, book_verses, _, _ = data.join_by_toc(selected_verses)
    selected_book_verses = select_samples(book_verses, [ITALIAN_BOOK_ID], TRUNCATE_BOOKS)
    char_counter = get_char_distribution(''.join([el for lis in selected_verses.values() 
                                                  for el in lis]))
    book_id_versions = create_word_split_sets(selected_book_verses, N_MERGES, OUTPUT_PATH, 
                                              filename.split('/')[-1])
    n_pairs_verses = book_id_versions[ITALIAN_BOOK_ID]
    verse_tokens = n_pairs_verses[0]
    base_dir = get_output_file_dir(OUTPUT_PATH, filename)
    base_filename = os.path.join(base_dir, f'{os.path.basename(filename)}_{ITALIAN_BOOK_ID}_v{0}')
    entropies = get_entropies(verse_tokens,
                                                       base_filename,
                                                       REMOVE_MISMATCHER_FILES,
                                                       char_counter,
                                                       MISMATCHER_PATH)
    bible_entropies[bib] = entropies

In [None]:
bible_entropies['ita-x-bible-riveduta.txt']['masked'] - bible_entropies['ita-x-bible-riveduta.txt']['orig']

In [None]:
bible_entropies[ITALIAN_BIBLE]['masked'] - bible_entropies[ITALIAN_BIBLE]['orig']

In [None]:
print(previous_verses)

I created a version of the file that excludes the empty verses and their preceding verses, and now the results are consistent with the other Italian-language bibles.

### Commits más recientes en esta Biblia

No hay. He abierto una "issue".

## Más preguntas

- Do general results hold without this bible?
- What are the correlation values for this bible?
- What does it cost me to remove this bible?
- Do other bibles also contain empty lines?

### Other bibles with empty lines

In [None]:
bibles_with_empty_verses = []
selected_books = [40, 41, 42, 43, 44, 66]
bibles = os.listdir(BIBLES_PATH)
for bible in bibles:
    with open(os.path.join(BIBLES_PATH, bible)) as f:
        lines = f.readlines()
    verses = [line.strip() for line in lines if any([line.strip().startswith(str(book)) 
                                                     for book in selected_books])]
    empty_verses = [verse for verse in verses if verse[8:].strip() == '']
    if len(empty_verses) > 0:
        bibles_with_empty_verses.append((bible, empty_verses))

In [None]:
print(len(bibles_with_empty_verses))

1441 bibles have empty verses. This means that empty verses are not the problem. 

### Long lines

How long was the problematic verse in the bible we just looked at?

In [None]:
with open(os.path.join(BIBLES_PATH, ITALIAN_BIBLE)) as f:
    lines = f.readlines()
verses = [line.strip() for line in lines if line.startswith(str(ITALIAN_BOOK_ID))]
problem_verse = [verse for verse in verses if verse.startswith('66007004')]
assert len(problem_verse) == 1
problem_verse = problem_verse[0]
print(len(problem_verse[9:]))

So let's see how many bibles have at least one verse with half this length.

In [None]:
bibles_with_long_verses = []
selected_books = [40, 41, 42, 43, 44, 66]
bibles = os.listdir(BIBLES_PATH)
for bible in bibles:
    with open(os.path.join(BIBLES_PATH, bible)) as f:
        lines = f.readlines()
    verses = [line.strip() for line in lines if any([line.strip().startswith(str(book)) 
                                                     for book in selected_books])]
    long_verses = [verse for verse in verses if len(verse[8:].strip()) > 1500]
    if len(long_verses) > 0:
        bibles_with_long_verses.append((bible, long_verses))

In [None]:
print(len(bibles_with_long_verses))

70 is around 3% of all the bibles. I can believe that this could be a problem. Let's take a random one and look at the word-order vs word-structure plot.

In [None]:
bib_w_long_verse = random.choice(bibles_with_long_verses)
print(bib_w_long_verse)

In [None]:
make_book_plot(df[df['bible'] == bib_w_long_verse[0]], 
               BOOK_ID_NAME[bib_w_long_verse[1][0][:2]], 
               bib_w_long_verse[0])

This did not give 0 values in the way we saw for the Italian bible. So, the final suspicion is that this is because of repeated text. What is the longest text sequence that repeats at least 1 in the Italian bible?

In [None]:
def get_len_longest_repeating_seq(text: str) -> int:
    candidate = (0, -1)
    for n in range(1, int(len(text) / 2)):
        for ix in range(len(text)):
            if text[ix:ix+n] in text[ix+n:]:
                candidate = (n, ix)
                break
        if candidate[0] < n:
            break
    return candidate

In [None]:
text = 'aaaaabaaaaaaBible ita-x-bible-vita1997.txt contains several empty verses, because their content seems merged with a preceding verse. More importantly, verse 66007004 is especially suspicious, as it contains a lot of repeated text. This is not present in other Italian-language bibles, and I believe it is a transcription error. The metadata for this translation points to https://www.bible.com/bible/92/mat.1.bdg, which no longer works. None of the versions found in https://www.bible.com/bible/92 corresponds to either La Parola è Vita or 1997. The publisher is listed as "Biblica, Inc.", but https://www.biblica.com does not work.'
get_len_longest_repeating_seq(text)

In [None]:
with open(os.path.join(BIBLES_PATH, ITALIAN_BIBLE)) as f:
    lines = [el[8:].strip() for el in f.readlines() if el.startswith('66007004')]
assert len(lines) == 1
line = lines[0]
print(get_len_longest_repeating_seq(line))

In [None]:
print(line[:1162])

In [None]:
print(line[1162:])

In [None]:
for bible, long_verses in bibles_with_long_verses:
    for long_verse in long_verses:
        verse = long_verse[8:].strip()
        if get_len_longest_repeating_seq(verse)[0] > 600:
            print(bible, long_verse[:8])

So, the only other bible with such long verse with repeating text is kss-x-bible.txt. Let's look at the plot for that one.

In [None]:
make_book_plot(df[df['bible'] == 'kss-x-bible.txt'], 
               BOOK_ID_NAME['40'], 
               'kss-x-bible.txt')

In [None]:
make_book_plot(df[df['bible'] == 'kss-x-bible.txt'], 
               BOOK_ID_NAME['41'], 
               'kss-x-bible.txt')

In [None]:
make_book_plot(df[df['bible'] == 'kss-x-bible.txt'], 
               BOOK_ID_NAME['42'], 
               'kss-x-bible.txt')

In [None]:
make_book_plot(df[df['bible'] == 'kss-x-bible.txt'], 
               BOOK_ID_NAME['44'], 
               'kss-x-bible.txt')

### Do general results hold without this bible?

It seems so, because the general trends are the same.

### What are the correlation values for this bible?

In [None]:
df['n_splits'] = df.apply(lambda row: row['iter_id'] if row['experiment'] == 'splitting' 
                          else (-1) * row['iter_id'], 1)
get_spearman(df[(df['bible'] == ITALIAN_BIBLE) & (df['book'] == ITALIAN_BOOK)])

In [None]:
# Another Italian bible
other_ita_bib = df[(df['bible'].apply(lambda x: x.startswith('ita') and x != ITALIAN_BIBLE))].bible.tolist()[0]
get_spearman(df[(df['bible'] == other_ita_bib) & (df['book'] == ITALIAN_BOOK)])

The conclusions do not change significantly. What about the other bible?

In [None]:
get_spearman(df[(df['bible'] == 'kss-x-bible.txt') & (df['book'] == BOOK_ID_NAME['41'])])

Again, consistent with what we observe for all bibles, so I'm not worried. The functional form might change but the Spearman correlation coefficient does not.

### What does it cost me to remove this bible?

If I add these two bibles to the list of bibles to be excluded, then modify notebook 34 and run it again, I should be able to update the results very quickly.

I did this, and obtained the same results, so I will not change the plots on the paper.

# etu-x-bible.txt, John

In [None]:
df[(df['bible'] == ETU_BIBLE) & (df['book'] == ETU_BOOK) & (df['iter_id'] == 0)]

In [None]:
book_name_id = {v: int(k) for k, v in BOOK_ID_NAME.items()}

This is extremely suspicious: why should the shuffled and masked versions have rather significantly different values, while the 'orig' values are entirely the same? The shuffling and masking is done in entirely the same way by the two get_entropies functions, and they call the same get_entropy function. My preferred explanation: this book is too short, so that random fluctuations matter. Let's run the function a few times with different seeds.

In [None]:
def run_experiment(seed):
    random.seed(seed)
    filename = os.path.join(BIBLES_PATH, ETU_BIBLE)
    book_id = book_name_id[ETU_BOOK]
    selected_book_verses, char_counter = read_selected_verses(filename,
                                                                  LOWERCASE,
                                                                  [book_id],
                                                                  TRUNCATE_BOOKS)
    book_id_versions = create_word_split_sets(selected_book_verses, N_MERGES, OUTPUT_PATH, ETU_BIBLE)
    n_pairs_verses = book_id_versions[book_name_id[ETU_BOOK]]
    sample_verses = n_pairs_verses[0]
    base_dir = get_output_file_dir(OUTPUT_PATH, filename)
    base_filename = os.path.join(base_dir, f'{os.path.basename(filename)}_{book_id}_v{0}')
    entropies = get_entropies(sample_verses,
                      base_filename,
                      REMOVE_MISMATCHER_FILES,
                      char_counter,
                      MISMATCHER_PATH)
    return entropies

entropies_experiments = []
for seed in (10, 30, 100, 300, 1000):
    entropies_experiments.append(run_experiment(seed))

In [None]:
for k in ('orig', 'shuffled', 'masked'):
    print(k, [el[k] for el in entropies_experiments])

The results were completely consistent from one run to the next, so that cannot be the explanation. Let's try to use the compression_entropy function.

In [None]:
filename = os.path.join(BIBLES_PATH, ETU_BIBLE)
book_id = book_name_id[ETU_BOOK]
selected_book_verses, char_counter = read_selected_verses(filename,
                                                              LOWERCASE,
                                                              [book_id],
                                                              TRUNCATE_BOOKS)
book_id_versions = create_word_split_sets(selected_book_verses, N_MERGES, OUTPUT_PATH, ETU_BIBLE)
n_pairs_verses = book_id_versions[book_name_id[ETU_BOOK]]
sample_verses = n_pairs_verses[0]
base_dir = get_output_file_dir(OUTPUT_PATH, filename)
base_filename = os.path.join(base_dir, f'{os.path.basename(filename)}_{book_id}_v{0}')

In [None]:
sample_verse_tokens = [[token.token for token in verse] for verse in sample_verses]

In [None]:
print(get_pasting_entropies(sample_verse_tokens,
                  base_filename,
                  REMOVE_MISMATCHER_FILES,
                  char_counter,
                  MISMATCHER_PATH))

In [None]:
df[(df['bible'] == ETU_BIBLE) & (df['book'] == ETU_BOOK) & (df['experiment'] == 'pasting') & (df['iter_id'] == 0)]

In [None]:
df[(df['bible'] == ETU_BIBLE) & (df['book'] == ETU_BOOK) & (df['experiment'] == 'splitting') & (df['iter_id'] == 0)]

Now, interestingly, the values I got were different from both the splitting and pasting values obtained before. Is it the case that, in the case of pasting, the random seed matters more? It seems odd, but let's try it.

In [None]:
def run_pasting_experiment(seed):
    random.seed(seed)
    filename = os.path.join(BIBLES_PATH, ETU_BIBLE)
    book_id = book_name_id[ETU_BOOK]
    selected_book_verses, char_counter = read_selected_verses(filename,
                                                                  LOWERCASE,
                                                                  [book_id],
                                                                  TRUNCATE_BOOKS)
    book_id_versions = create_word_split_sets(selected_book_verses, N_MERGES, OUTPUT_PATH, ETU_BIBLE)
    n_pairs_verses = book_id_versions[book_name_id[ETU_BOOK]]
    sample_verses = n_pairs_verses[0]
    base_dir = get_output_file_dir(OUTPUT_PATH, filename)
    base_filename = os.path.join(base_dir, f'{os.path.basename(filename)}_{book_id}_v{0}')
    sample_verse_tokens = [[token.token for token in verse] for verse in sample_verses]
    entropies = get_pasting_entropies(sample_verse_tokens,
                  base_filename,
                  REMOVE_MISMATCHER_FILES,
                  char_counter,
                  MISMATCHER_PATH)
    return entropies

entropies_experiments = []
for seed in (10, 30, 100, 300, 1000):
    entropies_experiments.append(run_pasting_experiment(seed))

In [None]:
for k in ('orig', 'shuffled', 'masked'):
    print(k, [el[k] for el in entropies_experiments])

These are also self-consistent. What if this bible contains a start-of-token character that matches the one I reserved for middle-of-token starting character?

In [None]:
verse_tokens = random.sample(sample_verses, k=len(sample_verses))

In [None]:
shuffled = [random.sample(words, k=len(words)) for words in verse_tokens]

In [None]:
char_str = ''.join(char_counter.keys())
char_weights = [char_counter[el] for el in char_str]

In [None]:
from compression_entropy import join_verses as join_verses_pasting

In [None]:
joined_splitting = join_verses(shuffled, insert_spaces=True)

In [None]:
joined_pasting = join_verses_pasting([[token.token for token in verse] for verse in shuffled], insert_spaces=True)

In [None]:
joined_splitting[:100]

In [None]:
joined_pasting[:100]

In [None]:
[el for el in shuffled[0] if not el.is_start_of_word]

I had to look through the code in debug mode, but eventually I found that the problem is that there are certain non-standard whitespaces in the bible, such as here:

In [None]:
'a bhá'.split(' ')

Not specifying the split character would have been better:

In [None]:
'a bhá'.split()

Action points:

1. Add an issue on Github
2. Add a todo in my code
3. Verify if many bibles have this issue

In [None]:
for bible in os.listdir(BIBLES_PATH):
    if bible in skip_bibles:
        continue
    with open(os.path.join(BIBLES_PATH, bible)) as f:
        lines = f.readlines()
    reached_notes = False
    found_weird = False
    for line in lines:
        if '# notes' in line.lower():
            reached_notes = True
        if line.startswith('#'):
            continue
        if not reached_notes:
            continue
        try:
            if int(line[:2]) not in selected_books:
                continue
        except ValueError as e:
            print('ERROR:', bible, line)
            raise e
        content = line[8:].strip()
        if content.split() != content.split(' ') and content.strip() != '':
            print(bible, line[:8].strip())
            found_weird = True
            break

I'm not going to exclude these in the final analysis, but I tried excluding them as a check, and nothing changed.

# all bibles that have relatively large gaps at the 0-points

After all the checks above, all remaining differences are insignificant.