# Initial Exploration of Biblical Texts

Name: Isaac Anderson

Date: Sept 3 2025

This problem set must accomplish the following tasks:
1. Read in 'SF_2009-01-20_GRC_TISCHENDORF_(TISCHENDORF GREEK NT(STRONGS)).xml' and save data to a dataframe (one word per row).
2. Parse the rmac codes.
3. Read in 'strongs-dictionary.xhtml' and save data to a dataframe (one term per row).
4. Compute top 50 lemmas by frequency

**Zipf's law**: in natural language, the frequency of a word is inversely proportion to its rank. e.g. the second more frequent word occurs half as often as the first most frequent word. So the top few words cover a huge fraction of the text.
If you have $n$ total tokens and sorted counts $f_1, f_2, \ldots, f_3$, the coverage of the first k terms is $C_k = \frac{\sum_{i=1}^k f_i}{n}$
The ideal Zipf prediction is $p(k) \propto 1/k$. For a finite vocabulary size of $n$, the ideal Zipf prediction is $p(k) = \frac{1/k}{\sum_{i=1}^n 1/i}$

6. Plot the coverage of the top 20 lemmas and list them in a table along with their Strong's definitions. On this same plot, plot the ideal Zipf prediction for a finite vocabulary size.
7. Identify a way to drop out the content-less words and then plot the coverage of these new top 20 lemmas and list them in a table along with their Strong's definitions. On this same plot, plot the ideal Zipf prediction for a finite vocabulary size.

8. Pickle your dataframes.

My WORKINGS.

In [184]:
# Dependancies
import pandas as pd
from bs4 import BeautifulSoup # requires lxml

In [None]:
tisch_greek = pd.read_xml(
                        "../Data/SF_2009-01-20_GRC_TISCHENDORF_(TISCHENDORF GREEK NT(STRONGS)).xml",
                        parser = "lxml",
                        xpath = "/XMLBIBLE//BIBLEBOOK//CHAPTER//VERS/gr")
tisch_greek.tail()

Unnamed: 0,str,rmac,gr,STYLE
137497,3588.0,t-gsm,τοῦ,
137498,2962.0,n-gsm,κυρίου,
137499,2424.0,n-gsm,Ἰησοῦ,
137500,3326.0,prep,μετὰ,
137501,3956.0,a-gpm,πάντων.,


## Problem 2: Parsing RMACS.

#### Filtering the RMACS.

In [193]:
# filtering rmacs
tisch_greek['rmac'] = tisch_greek['rmac'].str.lower()
tisch_greek = tisch_greek[~tisch_greek['rmac'].str.endswith('-')]

#### Scraping RMAC data.

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re



index_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/RMACindex.html"
rmac_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/"

# Get the metadata for all possible rmacs.
response = requests.get(index_url)
soup = BeautifulSoup(response.text, 'html5lib')

all_links = soup.find('blockquote').find_all('a')

contents = []
hrefs = []
for link in all_links:
    hrefs.append(link.get('href'))
    contents.append(link.text.strip())


#### RMAC processing function.

In [None]:
all_labels = [] # used for making rmac database columns.
def decode_rmac(rmac_link: str) -> list:
    print(rmac_url+rmac_link)
    response = requests.get(rmac_url+rmac_link)
    rmac_soup = BeautifulSoup(response.text, 'html5lib')
    rows  = rmac_soup.find_all('tr')
    first_entry = rows[2]

    lines = first_entry.text.split("\n")
    meaningful_lines = []
    for line in lines:
        cleaned_item = line.strip()
        if re.search(r"\w+\: ",cleaned_item):
            cleaned_item = cleaned_item.strip()
            meaningful_lines.append(cleaned_item)
            
    print(meaningful_lines)

    rmac_decoded = []
    for line in meaningful_lines:
        label = re.search(r"(\w+)\:", line).group()
        meaning = re.search(r"(?<=: )\w+", line).group()
        rmac_decoded.append({label: meaning})

        all_labels.append(label) # This is for creating columns down the line. 
    return rmac_decoded

In [223]:
rmacs_df = pd.DataFrame()
rmacs_df['rmac'] = contents
rmacs_df.head()

Unnamed: 0,rmac
0,A-APF
1,A-APF-C
2,A-APF-S
3,A-APM
4,A-APM-C


In [238]:
rmac_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/"
new = []
for index, row in rmacs_df.iterrows():
    decoded_rmac_parts = decode_rmac(row['rmac']+".htm")

    for dict_segment in decoded_rmac_parts:
        rmacs_df.at[index, list(dict_segment.keys())[0]] = list(dict_segment.values())[0]

    


https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APF.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward).', 'Number: Plural.', 'Gender: Feminine']
https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APF-C.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward).', 'Number: Plural.', 'Gender: Feminine.', 'Degree: Comparative.']
https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APF-S.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward)', 'Degree: Superlative.']
https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APM.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward).', 'Number: Plural.', 'Gender: Masculine']
https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APM-C.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward).', 'Number: Plural.', 'Gender: Masculine.', 'Degree: Comparative.']
https://www.modernliteralversion.

In [239]:
rmacs_df.head()

Unnamed: 0,rmac,Speech:,Case:,Number:,Gender:,Degree:,Form:,Tense:,Voice:,Mood:,Person:
0,A-APF,Adjective,Accusative,Plural,Feminine,,,,,,
1,A-APF-C,Adjective,Accusative,Plural,Feminine,Comparative,,,,,
2,A-APF-S,Adjective,Accusative,,,Superlative,,,,,
3,A-APM,Adjective,Accusative,Plural,Masculine,,,,,,
4,A-APM-C,Adjective,Accusative,Plural,Masculine,Comparative,,,,,


In [233]:
tester = {"hi":"bye"}
print(list(tester.keys())[0])
print(list(tester.values())[0])

hi
bye


In [225]:
print(new)

[[{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Number:': 'Plural'}, {'Gender:': 'Feminine'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Number:': 'Plural'}, {'Gender:': 'Feminine'}, {'Degree:': 'Comparative'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Degree:': 'Superlative'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Number:': 'Plural'}, {'Gender:': 'Masculine'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Number:': 'Plural'}, {'Gender:': 'Masculine'}, {'Degree:': 'Comparative'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Degree:': 'Superlative'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Number:': 'Plural'}, {'Gender:': 'Neuter'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Number:': 'Plural'}, {'Gender:': 'Neuter'}, {'Degree:': 'Comparative'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Degree:': 'Superlative'}], [{'Speech:': 'Adjective'}, {'Case:': 'Accusative'}, {'Number:': 'Sin

## Problem 3: STRONG's Dictionary.

In [None]:
strongs_dictionary = pd.read_xml("../Data/strongs-dictionary.xhtml")
strongs_dictionary.head()

# using BeautifulSoup4
with open('../Data/strongs-dictionary.xhtml', 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file, features='lxml')

words = soup.find_all('li')
print(words[0].text)
strongs_dictionary = pd.DataFrame(words)

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?