Here we will try to reproduce Fig 1 of Bentz et al. To avoid running the code for all the languages, we will first focus on one bible, and compare the entropies obtained using the methods of Bentz et al with the entropies obtained using the methods of Montemurro and Zanette.

In [None]:
import pandas as pd
import data
from analysis import full_entropy_calculation_bpw
import analysis
import numpy as np
#from compression import shortest_unseen_subsequence_lengths
import compression_entropy as ce

In [None]:
mz_entropies = pd.read_csv('output/MontemurroZanette/eng-x-bible-world_entropies.csv')

Now let's open that bible and check that we get exactly the same values.

In [None]:
# Variables related to the location of the data and the type of system
bibles_path = '/home/pablo/Documents/GitHubRepos/paralleltext/bibles/corpus/'
bible_filename = 'eng-x-bible-world.txt'
output_path = 'output/BentzEtAl/'
# Variables related to the processing of text for GPT-2
prompt = ''
separator = ' '
# Variables related to the processing of text for unigram entropies
remove_punctuation = False
lowercase = False

bible = data.parse_pbc_bible(bibles_path + bible_filename)

"""For each of these hierarchical orders, we can compute the entropy per word and the unigram entropy."""
by_bible, _, by_book, _, _ = bible.join_by_toc()
by_level = {'bible': by_bible, 'book': by_book}

eos_token = ''
level_text = {level_name: data.join_texts_in_dict(id_texts, prompt, eos_token, separator) \
              for level_name, id_texts in by_level.items()}

raw_name = output_path + bible_filename
level_entropies = {level_name: full_entropy_calculation_bpw(id_text,
                                                        remove_punctuation,
                                                        lowercase,
                                                        f'{raw_name}_{level_name}') \
                   for level_name, id_text in level_text.items()}

level_avg_text_len = {level_name: np.mean([len(data.tokenize(text, remove_punctuation, lowercase)) \
                                           for text in id_text.values()]) \
                      for level_name, id_text in level_text.items()}

# Save all these values to a Pandas dataframe that we can use to make histograms and compute statistics
df = pd.DataFrame(columns=('level', 'n_tokens', 'H', 'H_s', 'H_r', 'id'))
for level_name, section_entropies in level_entropies.items():
    for section_id, entropies in section_entropies.items():
        row = (level_name, len(data.tokenize(level_text[level_name][section_id], remove_punctuation, lowercase)),
               entropies[0], entropies[1], entropies[2], str(section_id))
        df.loc[len(df)] = row

# Compute the word-order entropies
df['D_r'] = df['H_r'] - df['H']
df['D_s'] = df['H_s'] - df['H']

In [None]:
df[df['level'] == 'bible']

In [None]:
mz_entropies[mz_entropies['level'] == 'bible']

These look exactly the same. To do:

1. try to re-do the calculation of H using my "dumb" implementation of the entropy calculation (i.e., without using the mismatcher)

2. recalculate H using:
https://github.com/dimalik/Hrate/

3. recalculate H_r using:
https://gist.github.com/shhong/1021654/

Is either of these significantly different from those obtained above? Understand why.

The next open question is: are there significant differences in the methodology of Bentz et al that make the results I get be different from theirs?

## 1: Re-do the calculation of H using my dumb implementation of the entropy calculation

This will take a very long time, so we will have to do it on a single book, not on the whole bible.

In [None]:
BOOK_ID = 41

In [None]:
mz_entropies[(mz_entropies['level'] == 'book') & (mz_entropies['id'] == str(BOOK_ID))]

In [None]:
df[(df['level'] == 'book') & (df['id'] == str(BOOK_ID))]

In [None]:
REMOVE_PUNCTUATION = False
LOWERCASE = False

In [None]:
# Tokenize for the unigram entropy computations
tokens = data.tokenize(level_text['book'][BOOK_ID], REMOVE_PUNCTUATION, LOWERCASE)

In [None]:
# The following was run once and now we read the file
"""
mismatches = shortest_unseen_subsequence_lengths(tokens)
with open(f'output/BentzEtAl/book_{BOOK_ID}_mismatches.txt', 'w') as f:
    for m in mismatches:
        f.write(str(m) + '\n')
"""

In [None]:
with open(f'output/BentzEtAl/book_{BOOK_ID}_mismatches.txt', 'r') as f:
    lines = f.readlines()
mismatches = [int(el) for el in lines]

### run the code that was used to get H above, and make sure we get the same

In [None]:
bibles_path = '/home/pablo/Documents/GitHubRepos/paralleltext/bibles/corpus/'
# Variables related to the processing of text for GPT-2
prompt = ''
separator = ' '
# Variables related to the processing of text for unigram entropies
remove_punctuation = False
lowercase = False

bible = data.parse_pbc_bible(bibles_path + bible_filename)

"""For each of these hierarchical orders, we can compute the entropy per word and the unigram entropy."""
_, _, by_book, _, _ = bible.join_by_toc()
by_level = {'book': by_book}

eos_token = ''
level_text = {level_name: data.join_texts_in_dict(id_texts, prompt, eos_token, separator) \
              for level_name, id_texts in by_level.items()}

raw_name = output_path + bible_filename
level_entropies = {level_name: full_entropy_calculation_bpw(id_text,
                                                            remove_punctuation,
                                                            lowercase,
                                                        f'{raw_name}_{level_name}') \
                   for level_name, id_text in level_text.items()}

In [None]:
print(f"{level_entropies['book'][BOOK_ID][0]:.6f}")

### use my mismatches to get the same result

In [None]:
dumb_H = ce.get_entropy(mismatches)

In [None]:
def percent_diff(a, b) -> str:
    return f'{abs(a-b)/(a+b)*100:.4f}'

In [None]:
print(percent_diff(level_entropies['book'][BOOK_ID][0], dumb_H), '% difference between methods')

### 1: Conclusion

There is no significant difference between the mismatcher and my method.

## 2: recalculate H using: https://github.com/dimalik/Hrate/

This is the method used by Bentz et al. If it gives a result that is significantly different from mine, that might explain my difference with Montemurro & Zanette, and/or, with Bentz et al

This is an R package, so I'm doing this in an R terminal. It's quite slow.

### 2: Conclusion

The estimate I got from the R package was 5.51866, which is very different from the estimate I got. I don't understand the discrepancy.

## 3: recalculate H_r using: https://gist.github.com/shhong/1021654/

In [None]:
import sys
sys.path.append("/home/pablo/Documents/GitHubRepos/")

In [None]:
import nsb_entropy as ne

In [None]:
from collections import Counter

In [None]:
sample = """Accordingly, our approach to empirically approximating the amount of redundancy at a
specific text position i is based on the following idea: In order to determine the redundancy at
position i, we examine the whole portion of the text up to (but not including) i and monitor
how many of the initial characters of the text portion starting at i have already occurred in the
same order somewhere in the preceding text, and record the length of longest continuous sub-
string. Our key quantity of interest l i is obtained by adding 1 to the longest match-length. As
an example, imagine that we read the King James version of the Bible (here the Gospel of Mat-
thew); let us assume that we have already read the first 127,348 characters of the text (again
including spaces). Around the end of this text portion, the text reads “they perceived that he
spake of them”, where the letter e in boldface, i.e. the 13 th letter position of the sentence, is the
final character read so far. At this position, we can go through the previous 127,347 characters
and will find out that the longest contiguous subsequence starting at i and being a repetition of
a sequence starting before this position can be found at position 125,150 (in boldface): “they
supposed that they should have . . .”. Thus, at position i, the resulting sequence that approxi-
mates redundancy is “ed that”. Including spaces, that sequence is 8 characters long, so l i = 9.
Interestingly, [11] showed that l i grows like (log i)/H where H is the entropy of the underlying
process. Since H can be thought of as the “ultimate compression” of the string [12], H can be
seen as a useful index of the amount of redundancy contained in the string (for convergence
issues, cf. the Materials and methods section). However, as [13] demonstrate, l i is highly
dependent on the choice of i, e.g. it both fluctuates to a considerable extent and naturally
depends on the amount of text that we have already read up to position i. To solve these prob-
lems, [13] simply suggest calculating l i at each position i of the whole string with a length of N
characters. The resulting estimates of redundancy at each position in the text are then aver-
aged, which leads to the following estimator of the entropy of the string"""

In [None]:
tokens = sample.split()

c = Counter(tokens)
input_histogram = np.array(list(c.values()))
nsb_entropy = ne.S(ne.make_nxkx(input_histogram, len(c.keys())), input_histogram.sum(), len(c.keys()))
print(f'NSB: {float(nsb_entropy):.4f}')

print(f'Mine: {analysis.unigram_entropy_direct(tokens):.4f}')

In [None]:
import os
files = os.listdir('output/BentzEtAl')
entropy_files = [el for el in files if el.endswith('_entropies.csv')]

In [None]:
BENTZ_ET_AL_FILES = 'output/BentzEtAl/'
LANGUAGE_MAP = {'deu': 'German', 'vie': 'Vietnamese', 'eng': 'English', 'mya': 'Burmese', 
                'esk': 'Inupiatun', 'zho': 'Chinese', 'grc': 'Greek', 'tam': 'Tamil', 
                'zul': 'Zulu', 'qvw': 'Quechua', 'chr': 'Cherokee', 'xuo': 'Kuo'}

In [None]:
dataframes = [(filename, pd.read_csv(BENTZ_ET_AL_FILES + filename)) for filename in entropy_files]

In [None]:
for i in range(len(dataframes)):
    dataframes[i][1]['filename'] = dataframes[i][0]

In [None]:
dataframes = [el[1] for el in dataframes]

In [None]:
for df in dataframes:
    df['iso'] = df['filename'].apply(lambda x: x.split('-')[0])
    df['bible_id'] = df['filename'].apply(lambda x: x.replace('_entropies.csv', '')[6:])

In [None]:
for df in dataframes:
    df.drop(columns=['filename'], inplace=True)

In [None]:
assert all([len(el) == 1 for el in dataframes])

In [None]:
H_unigram = [el['H_unigram'].tolist()[0] for el in dataframes]

In [None]:
print(f'Unigram entropy: {np.mean(H_unigram):.2f} +/- {np.std(H_unigram):.2f}')

This value is radically lower than the one obtained by Montemurro & Zanette and Bentz et al, as well as the one obtained by me.

### 3: Conclusion

The NSB entropy is lower than the old-fashioned one. This might be the origin of the difference with Bentz et al, though it does not explain the difference with Montemurro & Zanette.