# Motivation

In order to submit this to LChange23, it would be good to have a time-dependent component in the study. But it is not clear how this component can be added. Here, I explore a possible extension to our work that would include a time-dependent component.

In [None]:
import os
import data
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

In [None]:
BIBLE_DIR = "/home/pablo/Documents/GitHubRepos/paralleltext/bibles/corpus/"

In [None]:
translations = os.listdir(BIBLE_DIR)

In [None]:
metadata = []
for t in translations:
    with open(os.path.join(BIBLE_DIR, t)) as f:
        lines = f.readlines()
    comments, _, _ = data.split_pbc_bible_lines(lines, parse_content=False)
    comments['filename'] = t
    metadata.append(comments)

In [None]:
df = pd.DataFrame(metadata)

In [None]:
def is_int(text: str) -> bool:
    try:
        int(text)
        return True
    except ValueError:
        return False

In [None]:
# TODO: do this programmatically
df.loc[239, 'year_short'] = '1860'
df.loc[838, 'year_short'] = ''

In [None]:
assert len(df[df['year_short'].apply(lambda x: x.strip() != '' and not is_int(x))]) == 0

In [None]:
# From the ones that don't have year_short, check the year_long
not_parsed = df[df.apply(lambda row: row['year_short'].strip() == '' and row['year_long'].strip() != '', 1)][['language_name', 'year_long']]

In [None]:
for i in range(len(not_parsed)):
    print((not_parsed.index[i], not_parsed.values[i][1]))

In [None]:
index_year_long = {437: '', 460: 2003, 489: '', 864: 1965, 1217: '', 1316: 1975, 1580: '', 1590: '', 1669: 2006, 
                   1984: 2011}
for index, year_long in index_year_long.items():
    df.loc[index, 'year_long'] = str(year_long)

In [None]:
assert len(df[df.apply(lambda row: row['year_short'].strip() == '' and row['year_long'].strip() != '' and not is_int(row['year_long']), 1)]) == 0

In [None]:
def get_year(row: pd.Series) -> int:
    year_short = row['year_short'].strip()
    year_long = row['year_long'].strip()
    # If there is a year_short, parse it
    if year_short != '':
        return int(year_short)
    elif year_long != '':
        return int(year_long)
    else:
        return -1

In [None]:
df['year'] = df.apply(get_year, 1)

In [None]:
print(len(df), len(df[df['year'] == -1]))

In [None]:
assert len(df[df.apply(lambda row: row['year'] == -1 and \
                       row['year_long'].strip() + row['year_short'].strip() != '', 1)]) == 0

In [None]:
# Look at the year distribution of the years we have
df[df['year'] != -1]['year'].hist()

In [None]:
def get_century(year: int) -> int:
    if year == -1:
        return None
    return int(year // 100 + 1)

def test_get_century():
    assert get_century(2023) == 21
    assert get_century(1536) == 16
    
test_get_century()

In [None]:
df['century'] = df['year'].apply(get_century)

In [None]:
assert len(df[df['year'] == -1]) == len(df[df['century'].isnull()]) and \
len(df[(df['year'] == -1) & (df['century'].notnull())]) == 0 and \
len(df[(df['year'] != -1) & (df['century'].isnull())]) == 0

In [None]:
century_counter = Counter(df[df['century'].notnull()]['century'])

In [None]:
sorted([(k, v) for k, v in century_counter.items()], key=lambda x: x[0])

Here I see two possibilities:

1. Take the translations from before the 20th century. Compile the list of languages that are present in that list and also in the 20th or 21st centuries. Do an analysis of the evolution of those languages. (If doing this option, merge different variants such as Middle English, Ancient English, etc.)

2. Make a list of languages for which there is an old and a new variant (search for Ancient or Middle in the name, e.g.). Do an analysis of the evolution of those languages.

So, either way, we have to start by searching through the languages that we might light to merge. Looking for the words Old, Ancient, Middle, I found three candidates:

- Middle English (enm) / English (eng)

- Ancient Hebrew (hbo) / Hebrew (hbo)

- Ancient Greek (ell/grc) / Greek (ell/grc)

Unfortunately, I could not find a reliable translation in ancient Greek, so the only diachronic study across varieties of languages (option 2 above) can be done in English and Hebrew.

In [None]:
# TODO: send the following to Cysouw for fixing

In [None]:
# Before proceding, do some ISO code fixing
df.loc[153, 'closest_ISO_639-3'] = 'aym'
assert len(df[df['closest_ISO_639-3'].apply(lambda x: x in ('ayr', 'ayc'))]) == 0

In [None]:
df.loc[922, 'closest_ISO_639-3'] = 'bbc'
assert len(df[(df['language_name'] == 'Batak Toba') & (df['closest_ISO_639-3'] == 'bto')]) == 0

In [None]:
df.loc[836, 'closest_ISO_639-3'] = 'boa'
assert len(df[(df['language_name'] == 'Bora') & (df['closest_ISO_639-3'] == 'bao')]) == 0

In [None]:
df.loc[1572, 'language_name'] = 'Ranglong'
df.loc[1572, 'closest_ISO_639-3'] = 'rnl'
assert len(df[(df['language_name'] == 'E-De') & (df['closest_ISO_639-3'] == 'Ranglong')]) == 0

In [None]:
df.loc[891, 'language_name'] = 'Ewondo'
assert len(df[(df['language_name'] == 'Ewe') & (df['closest_ISO_639-3'] == 'ewo')]) == 0

In [None]:
df.loc[1215, 'language_name'] = 'hif'
assert len(df[(df['language_name'] == 'Fiji-Hindi') & (df['closest_ISO_639-3'] == 'fij')]) == 0

In [None]:
# These are mutually intelligible
df.loc[1136, 'closest_ISO_639-3'] = 'gub'
assert len(df[df['closest_ISO_639-3'] == 'tqb']) == 0

In [None]:
df.loc[826, 'closest_ISO_639-3'] = 'swh'
assert len(df[df['closest_ISO_639-3'] == 'bcw']) == 0

In [None]:
df.loc[1481, 'closest_ISO_639-3'] = 'bpr'
assert len(df[df['closest_ISO_639-3'] == 'bps']) == 0

In [None]:
for i in (13, 147, 1540):
    df.loc[i, 'closest_ISO_639-3'] = 'msa'
assert len(df[df['closest_ISO_639-3'] == 'zsm']) == 0

In [None]:
df.loc[541, 'closest_ISO_639-3'] = 'mbh'
assert len(df[df['closest_ISO_639-3'] == 'mnh']) == 0

In [None]:
df.loc[1015, 'language_name'] = 'Seim/Mende'
assert len(df[(df['closest_ISO_639-3'] == 'sim') & (df['language_name'] == 'Mende')]) == 0

In [None]:
for i in (576, 1633):
    df.loc[i, 'language_name'] = 'Nynorsk (Norsk)'
assert len(df[(df['closest_ISO_639-3'] == 'nno') & (df['language_name'] == 'Norsk')]) == 0

In [None]:
df.loc[598, 'closest_ISO_639-3'] = 'tsz'
assert len(df[df['closest_ISO_639-3'] == 'pua']) == 0

In [None]:
for i in df[df['closest_ISO_639-3'].apply(lambda x: x in ('als', 'aln'))].index:
    df.loc[i, 'closest_ISO_639-3'] = 'sqi'
assert len(df[df['closest_ISO_639-3'].apply(lambda x: x in ('als', 'aln'))]) == 0

In [None]:
for i in (968, 997):
    df.loc[i, 'closest_ISO_639-3'] = 'nep'
assert len(df[df['closest_ISO_639-3'] == 'npi']) == 0

In [None]:
assert not any([grp['closest_ISO_639-3'].nunique() > 1 and lbl.strip() != '' and lbl != 'Greek' and \
                lbl != 'ελληνικά' and lbl != '文言（中文）' \
                for lbl, grp in df.groupby('language_name')])

In [None]:
# This one can be seen in the file name
df.loc[607, 'year'] = 1894

## 1. Variations across years

In this case, we will take the translations from before the 20th century. Then we will merge different variants (Middle English and English, Ancient Hebrew and Hebrew). Finally, we will check for which languages we have variants before the 20th century and on the 20th or 21st centuries, and we will do a diachronic study of those.

In [None]:
# Make a copy of the dataframe
df1 = df.reset_index()

# Merge variants of Greek
df1['closest_ISO_639-3'] = df1['closest_ISO_639-3'].apply(lambda x: 'ell' if x.strip() == 'grc' else x)

In [None]:
assert len(df1[df1['closest_ISO_639-3'] == 'grc']) == 0

In [None]:
# TODO: point out to Cysouw that there are inconsistencies with the Greek and Chinese

In [None]:
df1['closest_ISO_639-3'] = df1['closest_ISO_639-3'].apply(lambda x: 'zho' if x.strip() == 'lzh' else x)

In [None]:
assert len(df1[df1['closest_ISO_639-3'] == 'lzh']) == 0

In [None]:
df1['closest_ISO_639-3'] = df1['closest_ISO_639-3'].apply(lambda x: 'eng' if x.strip() == 'enm' else x)
assert len(df1[df1['closest_ISO_639-3'] == 'enm']) == 0

In [None]:
old_languages = df1[(df1['century'].notnull()) & (df1['century'] < 20)]['closest_ISO_639-3'].unique()

In [None]:
new_languages = df1[(df1['century'].notnull()) & (df1['century'] >= 20)]['closest_ISO_639-3'].unique()

In [None]:
diachronic_languages = [el for el in new_languages if el in old_languages]

In [None]:
diachronic_df = df1[(df1['closest_ISO_639-3'].apply(lambda x: x in diachronic_languages)) & (df1['year'] != -1)].sort_values(by=['closest_ISO_639-3', 'year']).reset_index(drop=True)

## 2. Variations across ages

English, Greek, Hebrew, Chinese

- English: enm vs eng (ISO)

- Hebrew: Ancient Hebrew vs Hebrew (name)

- Chinese: lzh vs zho (ISO)

- Greek: grc vs ell (ISO) -> but beware of inconsistencies!

In [None]:
df2 = df.reset_index()

In [None]:
df2.loc[1530, 'closest_ISO_639-3'] = 'lzh'

In [None]:
variant_df = df2[df2.apply(lambda row: row['closest_ISO_639-3'] in ('enm', 'eng', 'lzh', 'zho', 'grc', 'ell', 'hbo'), 1)].reset_index(drop=True)

In [None]:
variant_df.sort_values(by=['closest_ISO_639-3', 'year'], inplace=True)

## Entropy calculations

Now we have to decide how the study will be set up. Ideally we'd like to get, for each language, a single value for each year for each quantity. Then, we can create a plot like the ones from the paper, for each language. What we already have are calculations for specific books. We can combine these quantities somehow, or we can recompute the entropy for the concatenated books.

Following Koplenig et al, it makes more sense to create a different plot for each book, and to use approach 1, since it gives us more datapoints for the same language.

In [None]:
ENTROPIES_DIR = '/home/pablo/Documents/GitHubRepos/WordOrderBibles/output/KoplenigEtAl/WordPasting/HPC/'

In [None]:
entropies_files = [el for el in os.listdir(ENTROPIES_DIR) if el.endswith('.csv') and el.startswith('entrop')]

In [None]:
entropies_dfs = []
for f in entropies_files:
    entropies_df = pd.read_csv(os.path.join(ENTROPIES_DIR, f))
    try:
        entropies_df = entropies_df[entropies_df['iter_id'] == 0].reset_index(drop=True)
    except KeyError:
        print(f)
        break
    entropies_df['filename'] = f.replace('entropies_', '').replace('.csv', '')
    entropies_dfs.append(entropies_df)
entropies_merged = pd.concat(entropies_dfs)

In [None]:
entropies_merged.sample(5)

Now we have to merge this dataframe with df1, which should have unique entries for filename.

In [None]:
assert len(df1) == df1['filename'].nunique()

In [None]:
df1_entropies = diachronic_df.merge(entropies_merged, on='filename', how='left', validate='1:m')

Pick English (eng) and create the 6 plots

In [None]:
def plot(full_df: pd.DataFrame, iso_code: str) -> None:
    full_df = full_df[full_df['book'].notnull()].reset_index(drop=True)
    data = full_df[full_df['closest_ISO_639-3'] == iso_code].reset_index(drop=True)
    unique_books = data['book'].unique()
    for book_name in unique_books:
        book_data = data[data['book'] == book_name].reset_index(drop=True)
        x = book_data['D_order'].tolist()
        y = book_data['D_structure'].tolist()
        labels = book_data['year'].tolist()
        fig, ax = plt.subplots()
        ax.scatter(x, y)
        plt.xlabel('Word order information')
        plt.ylabel('Word structure information')
        plt.title(f'{book_name}')
        for i, txt in enumerate(labels):
            ax.annotate(txt, (x[i], y[i]), rotation=45)
        plt.show()

In [None]:
plot(df1_entropies, 'eng')

This looks kind of confusing. Let's average over books to get a single number for a single file.

In [None]:
file_entropy = entropies_merged[['D_structure', 'D_order', 'filename']].groupby('filename').mean().reset_index()

In [None]:
fle1 = diachronic_df.merge(file_entropy, on='filename', how='left', validate='1:1')

In [None]:
def plot_mean(full_df: pd.DataFrame, iso_code: str) -> None:
    data = full_df[full_df['closest_ISO_639-3'] == iso_code].reset_index(drop=True)
    x = data['D_order'].tolist()
    y = data['D_structure'].tolist()
    labels = data['year'].tolist()
    fig, ax = plt.subplots()
    ax.scatter(x, y)
    plt.xlabel('Word order information')
    plt.ylabel('Word structure information')
    for i, txt in enumerate(labels):
        ax.annotate(txt, (x[i], y[i]), rotation=45)
    plt.show()

In [None]:
plot_mean(fle1, 'eng')

We still have multiple results for the same year. We could take the average by century.

In [None]:
def plot_century(full_df: pd.DataFrame, iso_code: str) -> None:
    data = full_df[full_df['closest_ISO_639-3'] == iso_code].reset_index(drop=True)
    data = data[['D_order', 'D_structure', 'century']].groupby('century').mean().reset_index()
    x = data['D_order'].tolist()
    y = data['D_structure'].tolist()
    labels = [int(el) for el in data['century'].tolist()]
    fig, ax = plt.subplots()
    ax.scatter(x, y)
    plt.xlabel('Word order information')
    plt.ylabel('Word structure information')
    for i, txt in enumerate(labels):
        ax.annotate(txt, (x[i], y[i]), rotation=45)
    plt.show()

In [None]:
plot_century(fle1, 'eng')

In [None]:
plot_century(fle1, 'deu')

It's hard to make sense of this data. The 20/21 could be errors because they are versions of the bible released in the 20/21 centuries but with old text. But the remaining points also don't make much sense.

Comparing to the variation in these quantities observed in the plot on the paper, these variations are rather small. This seems to indicate that a time variation in these quantities cannot be observed, at least not with this methodology.

### Option 2

Analysis number 1 was chosen because it was more similar to Koplenig et al. But what happens in the case of analysis 2?

In [None]:
variant_df = variant_df[['language_name', 'closest_ISO_639-3', 'year_long', 'year_short', 'year', 'filename']]

In [None]:
fle2 = variant_df.merge(file_entropy, on='filename', how='left', validate='1:1')

In [None]:
def map_iso(row: pd.Series) -> str:
    if row['closest_ISO_639-3'] != 'hbo':
        return row['closest_ISO_639-3']
    if row['language_name'] == 'Hebrew':
        return 'hbo-new'
    return 'hbo'

In [None]:
fle2['closest_ISO_639-3'] = fle2.apply(map_iso, 1)

In [None]:
# Now we need to get a single value per ISO
iso_entropy = fle2[['closest_ISO_639-3', 'D_structure', 'D_order']].groupby('closest_ISO_639-3').mean().reset_index()

In [None]:
iso_entropy

The results are:

- Hebrew -> invalid

- English -> too close to say anything

- Greek -> interesting; seemingly more structure and less order in ancient Greek

- Chinese -> seemingly much more structure and much less order in classical Chinese

With this known, it would be interesting to look at the analysis-1 results for Greek and Chinese

In [None]:
plot_century(fle1, 'ell')

So the results for Greek don't make sense, and the results for Chinese are absent because the earliest bible is from the 20th century.

In conclusion, the labelling of years seems to be unreliable, or this methodology can't pick out differences as well as we would like.

In [None]:
fle1[(fle1['closest_ISO_639-3'] == 'ell') & (fle1['century'].notnull()) & (fle1['D_order'].notnull())][['filename', 'century', 'D_order', 'D_structure']].sort_values(by='century')

So the conclusion is that some variation can be observed for Greek. This goes in line with my expectations about less use of cases in contemporary language.