# Initial Exploration of Biblical Texts

Name: Isaac Anderson

Date: Sept 3 2025

This problem set must accomplish the following tasks:
1. Read in 'SF_2009-01-20_GRC_TISCHENDORF_(TISCHENDORF GREEK NT(STRONGS)).xml' and save data to a dataframe (one word per row).
2. Parse the rmac codes.
3. Read in 'strongs-dictionary.xhtml' and save data to a dataframe (one term per row).
4. Compute top 50 lemmas by frequency

**Zipf's law**: in natural language, the frequency of a word is inversely proportion to its rank. e.g. the second more frequent word occurs half as often as the first most frequent word. So the top few words cover a huge fraction of the text.
If you have $n$ total tokens and sorted counts $f_1, f_2, \ldots, f_3$, the coverage of the first k terms is $C_k = \frac{\sum_{i=1}^k f_i}{n}$
The ideal Zipf prediction is $p(k) \propto 1/k$. For a finite vocabulary size of $n$, the ideal Zipf prediction is $p(k) = \frac{1/k}{\sum_{i=1}^n 1/i}$

6. Plot the coverage of the top 20 lemmas and list them in a table along with their Strong's definitions. On this same plot, plot the ideal Zipf prediction for a finite vocabulary size.
7. Identify a way to drop out the content-less words and then plot the coverage of these new top 20 lemmas and list them in a table along with their Strong's definitions. On this same plot, plot the ideal Zipf prediction for a finite vocabulary size.

8. Pickle your dataframes.

My WORKINGS.

In [184]:
# Dependancies
import pandas as pd
from bs4 import BeautifulSoup # requires lxml

In [257]:
tisch_greek = pd.read_xml(
                        "../Data/SF_2009-01-20_GRC_TISCHENDORF_(TISCHENDORF GREEK NT(STRONGS)).xml",
                        parser = "lxml",
                        xpath = "/XMLBIBLE//BIBLEBOOK//CHAPTER//VERS/gr")
tisch_greek.head()

Unnamed: 0,str,rmac,gr,STYLE
0,976.0,n-nsf,Βίβλος,
1,1078.0,n-gsf,γενέσεως,
2,2424.0,n-gsm,Ἰησοῦ,
3,5547.0,n-gsm,Χριστοῦ,
4,5207.0,n-gsm,υἱοῦ,


## Problem 2: Parsing RMACS.

#### Filtering the RMACS.

In [193]:
# filtering rmacs
tisch_greek['rmac'] = tisch_greek['rmac'].str.lower()
tisch_greek = tisch_greek[~tisch_greek['rmac'].str.endswith('-')]

#### Scraping RMAC data.

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re



index_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/RMACindex.html"
rmac_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/"

# Get the metadata for all possible rmacs.
response = requests.get(index_url)
soup = BeautifulSoup(response.text, 'html5lib')

all_links = soup.find('blockquote').find_all('a')

contents = []
hrefs = []
for link in all_links:
    hrefs.append(link.get('href'))
    contents.append(link.text.strip())


#### RMAC processing function.

In [None]:
all_labels = [] # used for making rmac database columns.
def decode_rmac(rmac_link: str) -> list:
    print(rmac_url+rmac_link)
    response = requests.get(rmac_url+rmac_link)
    rmac_soup = BeautifulSoup(response.text, 'html5lib')
    rows  = rmac_soup.find_all('tr')
    first_entry = rows[2]

    lines = first_entry.text.split("\n")
    meaningful_lines = []
    for line in lines:
        cleaned_item = line.strip()
        if re.search(r"\w+\: ",cleaned_item):
            cleaned_item = cleaned_item.strip()
            meaningful_lines.append(cleaned_item)
            
    print(meaningful_lines)

    rmac_decoded = []
    for line in meaningful_lines:
        label = re.search(r"(\w+)\:", line).group()
        meaning = re.search(r"(?<=: )\w+", line).group()
        rmac_decoded.append({label: meaning})

        all_labels.append(label) # This is for creating columns down the line. 
    return rmac_decoded

In [None]:
rmacs_df = pd.DataFrame()
rmacs_df['rmac'] = contents
rmacs_df.head()
rmac_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/"

for index, row in rmacs_df.iterrows():
    decoded_rmac_parts = decode_rmac(row['rmac']+".htm")

    for dict_segment in decoded_rmac_parts:
        rmacs_df.at[index, list(dict_segment.keys())[0]] = list(dict_segment.values())[0]

rmacs_df.to_csv("./all_rmacs_decoded")



https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APF.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward).', 'Number: Plural.', 'Gender: Feminine']
https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APF-C.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward).', 'Number: Plural.', 'Gender: Feminine.', 'Degree: Comparative.']
https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APF-S.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward)', 'Degree: Superlative.']
https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APM.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward).', 'Number: Plural.', 'Gender: Masculine']
https://www.modernliteralversion.org/bibles/bs2/RMAC/A-APM-C.htm
['Part of Speech: Adjective.', 'Case: Accusative (direct object; motion toward).', 'Number: Plural.', 'Gender: Masculine.', 'Degree: Comparative.']
https://www.modernliteralversion.

In [262]:
tisch_rmacs_df = pd.DataFrame()
tisch_rmacs_df['rmac'] = [item.upper() for item in list(tisch_greek['rmac'])]
tisch_rmacs_df.head()
tisch_rmac_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/"


In [267]:
tisch_parsed_rmacs_df = pd.merge(rmacs_df, tisch_rmacs_df)
tisch_greek_parsed = pd.merge(tisch_parsed_rmacs_df, tisch_greek)
tisch_greek_parsed

Unnamed: 0,rmac,Speech:,Case:,Number:,Gender:,Degree:,Form:,Tense:,Voice:,Mood:,Person:,str,gr,STYLE
0,A-APF-S,Adjective,Accusative,,,Superlative,,,,,,2078.0,"ἐσχάτας,",
1,A-APM-S,Adjective,Accusative,,,Superlative,,,,,,4413.0,πρώτους,
2,A-APM-S,Adjective,Accusative,,,Superlative,,,,,,4413.0,πρώτους,
3,A-APM-S,Adjective,Accusative,,,Superlative,,,,,,4413.0,πρώτους·,
4,A-APM-S,Adjective,Accusative,,,Superlative,,,,,,2078.0,ἐσχάτους,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165678,V-RNP-ASF,Verb,Accusative,Singular,Feminine,,,peRfect,middle,Participle,,4279.0,προεπηγγελμένην,
165679,V-RPP-GPF,Verb,Genitive,Plural,Feminine,,,peRfect,Passive,Participle,,2808.0,κεκλεισμένων,
165680,V-RPP-GPF,Verb,Genitive,Plural,Feminine,,,peRfect,Passive,Participle,,2808.0,"κεκλεισμένων,",
165681,V-RPP-GPF,Verb,Genitive,Plural,Feminine,,,peRfect,Passive,Participle,,2808.0,κεκλεισμένων,


In [261]:
len(tisch_greek)

137502

## Problem 3: STRONG's Dictionary.

In [272]:
strongs_dictionary = pd.read_xml("../Data/strongs-dictionary.xhtml")
strongs_dictionary.head()

# using BeautifulSoup4
with open('../Data/strongs-dictionary.xhtml', 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file, features='lxml')

words = soup.find_all('li')
print(words)



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(file, features='lxml')




#### Reading in Strongs Dictionary as file.

In [None]:
with open("../Data/strongs-dictionary.xhtml", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, 'lxml')


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(file, 'lxml')


<ol>
<li id="ot:1" value="1"><i title="{awb}" xml:lang="hbo">אָב</i> a primitive word; father, in a literal and immediate, or figurative and remote application): <span class="kjv_def">chief, (fore-)father(-less), X patrimony, principal</span>. Compare names in "Abi-".</li>
<li id="ot:2" value="2"><i title="{ab}" xml:lang="oar">אַב</i> (Aramaic) corresponding to <a href="#ot:1"><i title="{awb}" xml:lang="hbo">אָב</i></a>: <span class="kjv_def">father</span>.</li>
<li id="ot:3" value="3"><i title="{abe}" xml:lang="hbo">אֵב</i> from the same as <a href="#ot:24"><i title="{aw-beeb'}" xml:lang="hbo">אָבִיב</i></a>; a green plant: <span class="kjv_def">greenness, fruit</span>.</li>
<li id="ot:4" value="4"><i title="{abe}" xml:lang="oar">אּנְבָּא</i> (Aramaic) corresponding to <a href="#ot:3"><i title="{abe}" xml:lang="hbo">אֵב</i></a>: <span class="kjv_def">fruit</span>.</li>
<li id="ot:5" value="5"><i title="{ab-ag-thaw'}" xml:lang="hbo">אֲבַגְתָא</i> of foreign origin; Abagtha, a eunuch of Xerxe

#### Processing in the RAW Strong's dictionary.

In [299]:
table = soup.find('ol').find_all('li')

clean_table = []
for row in table:
    text = row.text
    clean_table.append(text)

strongs_dictionary = pd.DataFrame(data=clean_table, columns=['word'])
strongs_dictionary.head()

Unnamed: 0,word
0,"אָב a primitive word; father, in a literal and ..."
1,אַב (Aramaic) corresponding to אָב: father.
2,אֵב from the same as אָבִיב; a green plant: gre...
3,אּנְבָּא (Aramaic) corresponding to אֵב: fruit.
4,"אֲבַגְתָא of foreign origin; Abagtha, a eunuch..."


#### Splitting the Strong's dataframe into word - definition

In [300]:
def get_greek_word(entry: str) -> str:
    return entry[0:entry.find(" ")]

def remove_greek_word(entry: str) -> str: 
    return entry[entry.find(" "):]
strongs_dictionary['greek_word'] = strongs_dictionary['word'].apply(get_greek_word)
strongs_dictionary['definition'] = strongs_dictionary['word'].apply(remove_greek_word)
strongs_dictionary = strongs_dictionary.drop("word", axis= 1)
strongs_dictionary.head()


Unnamed: 0,greek_word,definition
0,אָב,"a primitive word; father, in a literal and im..."
1,אַב,(Aramaic) corresponding to אָב: father.
2,אֵב,from the same as אָבִיב; a green plant: greenn...
3,אּנְבָּא,(Aramaic) corresponding to אֵב: fruit.
4,אֲבַגְתָא,"of foreign origin; Abagtha, a eunuch of Xerxe..."


## Problem 4: Top 50 Lemmas by Frequency.