This should follow a somewhat similar procedure to 10_reproduce_koplenig_et_al_fig_1.ipynb. Now, however, instead of loading multiple languages as the input, we take the same bible, and perform a word-pasting experiment, so that the same language ends up having multiple points in the structure vs order graph. The hypothesis is that when you paste more words together, more of the information will be contained in the word structure. In the Koplenig et al graph, this means the point will move towards the top left. Thus, let's pick a point at the bottom right to start with. "cmn" (Mandarin Chinese) is written in characters, and spaces are not well defined, so we skip it. The next language is "xuo" (Kuo), which is written in Latin letters, so we will study that one. As a second case for familiarity, I will also study English. So let's load a bible in each of these languages:

In [None]:
import data
import compression_entropy as ce
import random
from collections import defaultdict
import json

In [None]:
bibles_filenames = {'xuo': 'xuo-x-bible.txt', 'eng': 'eng-x-bible-world.txt'}

I follow the procedure in the main program of compression_entropy.py

In [None]:
BIBLES_PATH = '/home/pablo/Documents/GitHubRepos/paralleltext/bibles/corpus/'

In [None]:
files_with_path = [BIBLES_PATH + file.strip() for file in bibles_filenames.values()]

In [None]:
lowercase=True
REMOVE_MISMATCHER_FILES=True
chosen_books=[40, 41, 42, 43, 44, 66]
truncate_books=True

In [None]:
TEMP_FILES_PATH = 'output/KoplenigEtAl/WordPasting'

In [None]:
file_tokens = {}
for filename in files_with_path:
    bible = data.parse_pbc_bible(filename)
    tokenized = bible.tokenize(remove_punctuation=False, lowercase=lowercase)
    char_set = ''.join(set(''.join([el for lis in tokenized.verse_tokens.values() for el in lis])))
    _, _, book_verses, _, _ = data.join_by_toc(tokenized.verse_tokens)
    selected_book_verses = ce.select_samples(book_verses, chosen_books, truncate_books)
    book_base_filename = {book_id: TEMP_FILES_PATH + filename.split('/')[-1] + f'_{book_id}' \
                          for book_id in selected_book_verses.keys()}
    book_tokens = {book_id: random.sample(verses, k=len(verses)) for book_id, verses in selected_book_verses.items()}
    file_tokens[filename] = book_tokens

The first key is the filename. The second key is the book number. The value is a list of verses, and each verse is a list of tokens.

The next step in the original pipeline would be to create two more versions of the text; one shuffles the words within each verse, while the other replaces the words by arbitrary new words. If we allow the word-pasting experiment to work across verses, then we have an ill-defined new verse. For this reason, the word-pasting experiment will be applied at the verse level.

First some data exploration. What are the most common bigrams in these texts?

In [None]:
file_bigrams = {}
for filename, book_tokens in file_tokens.items():
    book_bigrams = {}
    for book_id, verses in book_tokens.items():
        bigram_counter = defaultdict(int)
        for verse in verses:
            for i, word in enumerate(verse[:-1]):
                bigram_counter[word + ' ' + verse[i+1]] += 1
        book_bigrams[book_id] = bigram_counter
    file_bigrams[filename] = book_bigrams

In [None]:
for book_id, bigram_counts in file_bigrams['/home/pablo/Documents/GitHubRepos/paralleltext/bibles/corpus/eng-x-bible-world.txt'].items():
    print(book_id, {bigram: count for bigram, count in bigram_counts.items() if count > 60})

In [None]:
file_bigrams['/home/pablo/Documents/GitHubRepos/paralleltext/bibles/corpus/eng-x-bible-world.txt'][40]['jesus christ']

This is a bit surprising to me, but let's go through with the experiment anyway. I need a function that, given a list of verses, returns a new list of verses in which the top bigram has been replaced by a single word. I will use a space as a separator.

In [None]:
def join_words(verse: list, locations: list) -> list:
    assert all([locations[i] > locations[i+1] for i in range(len(locations)-1)])
    location_set = set(locations)
    assert len(location_set) == len(locations)
    joined = []
    i = 0
    while i < len(verse):
        if i in location_set:
            joined.append(verse[i] + ' ' + verse[i + 1])
            i += 2
        else:
            joined.append(verse[i])
            i += 1
    return joined

def test_join_words():
    joined = join_words('I love the nightlife and I do not make a big fuss about it'.split(), [5, 2])
    assert ['I', 'love', 'the nightlife', 'and', 'I do', 'not', 'make', 'a', 'big', 'fuss', 'about', 'it'] == joined
    print('join_words works')
    
def test_join_words_copy():
    # Check that a copy was made even of the untouched verses
    joined = join_words('I love the nightlife and I do not make a big fuss about it'.split(), [])
    assert ['I', 'love', 'the', 'nightlife', 'and', 'I', 'do', 'not', 'make', 'a', 'big', 'fuss', 'about', 'it'] == joined
    print('join_words works with copy')
    
test_join_words()
test_join_words_copy()

In [None]:
def merge_positions(verses: list, positions: list) -> list:
    verse_locations = defaultdict(list)
    for position in positions:
        verse_locations[position[0]].append(position[1])
    verse_locations = {verse: sorted(locations, reverse=True) for verse, locations in verse_locations.items()}
    for verse_ix, locations in verse_locations.items():
        verses[verse_ix] = join_words(verses[verse_ix], locations)
    return verses

def test_merge_positions():
    verses = [['I', 'love', 'the', 'nightlife', 'and', 'I', 'do', 'not', 'make', 'a', 'big', 'fuss', 'about', 'it'],
             ['Belgium', 'plays', 'ugly'],
             ['No', 'hubo', 'otro', 'como', 'Forlan']]
    positions = [(0, 4), (0, 7), (2, 1)]
    merged = merge_positions(verses, positions)
    expected = [['I', 'love', 'the', 'nightlife', 'and I', 'do', 'not make', 'a', 'big', 'fuss', 'about', 'it'],
               ['Belgium', 'plays', 'ugly'],
               ['No', 'hubo otro', 'como', 'Forlan']]
    assert expected == merged, merged
    # Check that a copy was made even when no changes were made
    expected[1] = []
    assert ['Belgium', 'plays', 'ugly'] == merged[1]
    print('test_merge_positions work')
    
test_merge_positions()

In [None]:
def replace_top_bigram(verses: list) -> list:
    bigram_positions = defaultdict(list)
    for j, verse in enumerate(verses):
        for i, word in enumerate(verse[:-1]):
            bigram_positions[word + ' ' + verse[i+1]].append((j, i))
    # Now the bigram with the longest list of positions is the most frequent bigram
    top_bigram = ''
    n_pos = 0
    for bigram, positions in bigram_positions.items():
        if len(positions) > n_pos:
            top_bigram = bigram
            n_pos = len(positions)
    #print(top_bigram, n_pos)
    return merge_positions(verses, bigram_positions[top_bigram])

def test_replace_top_bigram():
    verses = ['Congratulations, you have finished installing TWiki!'.split(),
             'Replace this text with a description of your new TWiki site and links to content.'.split(),
             'To learn more about TWiki, visit the new TWiki web.'.split()]
    replaced = replace_top_bigram(verses)
    assert verses[0] == replaced[0]
    expected = [verses[0], 
                ['Replace', 'this', 'text', 'with', 'a', 'description', 'of', 'your', 'new TWiki', 'site', 'and', 
                 'links', 'to', 'content.'],
               ['To', 'learn', 'more', 'about', 'TWiki,', 'visit', 'the', 'new TWiki', 'web.']]
    assert expected == replaced
    # Check that a copy was made even when no changes were made
    expected[0] = []
    assert 'Congratulations, you have finished installing TWiki!'.split() == replaced[0]
    print('replace_top_bigram works')
    
test_replace_top_bigram()

These functions work, so now we can perform the experiment on the two bibles compiled above. I will treat is bible as fully independent, because that is what Koplenig et al did for this part of their work.

In [None]:
file_versions = {}
for file, book_tokens in file_tokens.items():
    book_id_versions = {}
    for book_id, tokens in book_tokens.items():
        joined_verses = [tokens]
        for n_joins in range(100):
            joined_verses.append(replace_top_bigram(joined_verses[-1]))
        book_id_versions[book_id] = joined_verses
    file_versions[file] = book_id_versions

Now we have all these different versions. I'm not sure 100 is going be enough to see an effect, but we can only know by plotting.

In [None]:
file_entropies = {}
for filename, book_tokens in file_versions.items():
    print(filename)
    book_id_entropies = {}
    for book_id, n_pairs_verses in book_tokens.items():
        print(book_id)
        n_pairs_entropies = {}
        for n_pairs, verse_tokens in enumerate(n_pairs_verses):
            print(n_pairs, end='')
            # This is now a list of lists, which is the realm on which we make the computations
            shuffled = [random.sample(words, k=len(words)) for words in verse_tokens]
            char_set = ''.join(set(''.join([el for lis in verse_tokens for el in lis])))
            masked = ce.mask_word_structure(verse_tokens, char_set)
            tokens = {'orig': verse_tokens, 'shuffled': shuffled, 'masked': masked}
            joined = {k: ce.join_verses(v, insert_spaces=True) for k, v in tokens.items()}
            base_filename = TEMP_FILES_PATH + filename.split('/')[-1] + f'_{book_id}_v{n_pairs}'
            filenames = {k: ce.to_file(v, base_filename, k) for k, v in joined.items()}
            hierarchical_level_mismatches = {hierarchical_level: ce.run_mismatcher(preprocessed_filename, 
                                                                                   REMOVE_MISMATCHER_FILES) \
                                             for hierarchical_level, preprocessed_filename in filenames.items()}
            hierarchical_level_entropy = {hierarchical_level: ce.get_entropy(mismatches) \
                                          for hierarchical_level, mismatches \
                                          in hierarchical_level_mismatches.items()}
            n_pairs_entropies[n_pairs] = hierarchical_level_entropy
        book_id_entropies[book_id] = n_pairs_entropies
    file_entropies[filename] = book_id_entropies

In [None]:
with open(f'{TEMP_FILES_PATH}/entropies.json', 'w') as f:
    json_string = json.dumps(file_entropies)
    f.write(json_string)