# Initial Exploration of Biblical Texts

Name: Isaac Anderson

Date: Sept 3 2025

This problem set must accomplish the following tasks:
1. Read in 'SF_2009-01-20_GRC_TISCHENDORF_(TISCHENDORF GREEK NT(STRONGS)).xml' and save data to a dataframe (one word per row).
2. Parse the rmac codes.
3. Read in 'strongs-dictionary.xhtml' and save data to a dataframe (one term per row).
4. Compute top 50 lemmas by frequency

**Zipf's law**: in natural language, the frequency of a word is inversely proportion to its rank. e.g. the second more frequent word occurs half as often as the first most frequent word. So the top few words cover a huge fraction of the text.
If you have $n$ total tokens and sorted counts $f_1, f_2, \ldots, f_3$, the coverage of the first k terms is $C_k = \frac{\sum_{i=1}^k f_i}{n}$
The ideal Zipf prediction is $p(k) \propto 1/k$. For a finite vocabulary size of $n$, the ideal Zipf prediction is $p(k) = \frac{1/k}{\sum_{i=1}^n 1/i}$

6. Plot the coverage of the top 20 lemmas and list them in a table along with their Strong's definitions. On this same plot, plot the ideal Zipf prediction for a finite vocabulary size.
7. Identify a way to drop out the content-less words and then plot the coverage of these new top 20 lemmas and list them in a table along with their Strong's definitions. On this same plot, plot the ideal Zipf prediction for a finite vocabulary size.

8. Pickle your dataframes.

My WORKINGS.

In [184]:
# Dependancies
import pandas as pd
from bs4 import BeautifulSoup # requires lxml

In [185]:
tisch_greek = pd.read_xml(
                        "../Data/SF_2009-01-20_GRC_TISCHENDORF_(TISCHENDORF GREEK NT(STRONGS)).xml",
                        parser = "lxml",
                        xpath = "/XMLBIBLE//BIBLEBOOK//CHAPTER//VERS/gr")

tisch_greek.tail()


Unnamed: 0,str,rmac,gr,STYLE
137497,3588.0,t-gsm,τοῦ,
137498,2962.0,n-gsm,κυρίου,
137499,2424.0,n-gsm,Ἰησοῦ,
137500,3326.0,prep,μετὰ,
137501,3956.0,a-gpm,πάντων.,


In [186]:
# https://chatgpt.com/share/68b884fd-1004-800c-b945-313f10b5bc43

print(tisch_greek['rmac'].value_counts())

# holder = (~tisch_greek['rmac'].str.contains("-"))
one_word_map = tisch_greek['rmac'].astype(str).str.contains("-")



rmac
conj          16271
prep          10390
adv            3879
n-nsm          3463
n-gsm          2937
              ...  
v-2adm-2s         1
v-adm-3p          1
v-2rap-nsf        1
v-rdi-3s          1
v-ras-2s          1
Name: count, Length: 1034, dtype: int64


In [187]:
# references for parse_rmac()

# verbs
tenses_dict = {"p":"present", "i":"imperfect", "f":"future", "a":"aorist", "r":"perfect",
               "l":"pluperfect", "2f":"2nd future", "2a":"2nd aorist", "2r":"2 perfect",
               "2l":"2nd pluperfect"}
voices_dict = {"a":"active", "m":"middle", "p":"passive", "e":"middle/passive", 
               "d":"middle deponent", "o":"passive deponent", "n":"middle/passive deponent"}
moods_dict = {"i":"indicative","s":"subjunctive","o":"optative","m":"imperative","n":"infinitive",
              "p":"participle"}
gender_dict = {"m":"masculine", "f":"feminine", "n":"neuter"}
cases_dict = {"n":"nominative","g":"genitive","d":"dative","a":"accusative"}
number_dict = {"s":"singular","p":"plural"}

# undeclinable
undeclined_forms_dict = {
    "adv":   "adverb",
    "conj":  "conjunction",
    "cond":  "conditional",
    "prt":   "particle",
    "prep":  "preposition",
    "inj":   "interjection",
    "aram":  "aramaic",
    "heb":   "hebrew",
    "n-pri": "indeclinable proper noun",
    "a-nui": "indeclinable numeral (adjective)",
    "n-li":  "indeclinable letter (noun)",
    "n-oi":  "indeclinable noun of other type"
}
# noun likes
noun_likes_dict = {
    "p":"preposition",
    "t":"definite article",
    "r":"relative pronoun",
    "a":"adjective",
    "d":"demonstrative pronoun",
    "i":"indefinite pronoun"
}

# declined forms
prefixs_dict = {
    "n":"noun",
    "a":"adjective",
    "r":"relative pronoun",
    "c":"reciprocal pronoun",
    "d":"demonstrative pronoun",
    "t":"definite article",
    "k":"correlative pronoun",
    "i":"interrogative pronoun",
    "x":"indefinite pronoun",
    "q":"correlative or interrogative pronoun",
    "f":"reflexive pronoun",
    "s":"possessive adjective",
    "p":"personal pronoun",
}

noun_cases_dict = {
    "n":"nominative",
    "v":"vocative",
    "g":"genitive",
    "d":"dative",
    "a":"accusative",
}

In [188]:
tester = ["hi","bye","die"]
tester = tester.pop(0)
print(tester)

hi


In [189]:
def parse_verb(verb: str) -> list:
    verb_sections = verb.split("-")
    verb_length = len(verb_sections)
    verb_sections = verb_sections[1:]
    ret_verb = ["verb"]

    # must be done for every verb.
    if verb_sections[0][0] == "2":
        ret_verb.append(tenses_dict[verb_sections[0][0:2]])
        ret_verb.append(voices_dict[verb_sections[0][2]])
        ret_verb.append(moods_dict[verb_sections[0][3]])
    else: 
        ret_verb.append(tenses_dict[verb_sections[0][0]])
        ret_verb.append(voices_dict[verb_sections[0][1]])
        ret_verb.append(moods_dict[verb_sections[0][2]])
        
    if verb_length == 1:
        return ret_verb
    elif verb_length == 2:
        ret_verb.append(verb_sections[1][0] + "person")
        ret_verb.append(number_dict[verb_sections[1][1]])
        return ret_verb
    elif verb_length == 3:
        if verb_sections[-1] == "att":
            ret_verb.append("Attic Greek form")
            return ret_verb
        ret_verb.append(cases_dict[verb_sections[2][0]])
        ret_verb.append(number_dict[verb_sections[2][1]])
        ret_verb.append(gender_dict[verb_sections[2][2]])
        return ret_verb
    else:
        raise Exception("Invalid amount of sections. Must be 2,3,4.")    

In [190]:
parse_verb('v-ran-att')

['verb', 'perfect', 'active', 'infinitive', 'Attic Greek form']

In [191]:
def parse_noun(noun: str) -> list: 
    noun_parts = noun
    ret_noun = ["noun"]

    ret_noun.append(noun_cases_dict[noun_parts[0]])
    ret_noun.append(number_dict[noun_parts[1]])
    ret_noun.append(gender_dict[noun_parts[2]])

    return ret_noun

def parse_noun_like(noun: str) -> list: 
    ret_noun_like = [noun_likes_dict[noun[0]]]
    noun_parts = noun[2:]
    # exceptions for prepositions
    if noun[0] == "p" and noun_parts[0].isnumeric():      
        ret_noun_like.append(noun_parts[0] + " person")
        ret_noun_like.append(noun_cases_dict[noun_parts[1]])
        ret_noun_like.append(number_dict[noun_parts[2]])
        return ret_noun_like
    else:
        ret_noun_like.append(noun_cases_dict[noun_parts[0]])
        ret_noun_like.append(number_dict[noun_parts[1]])
        ret_noun_like.append(gender_dict[noun_parts[2]])
        return ret_noun_like

def parse_particles(particle: str) -> list:
    ret_particle = ["particle"]

    part_spec_dict = {
        "n":"negative",
        "c":"connective",
        "d":"explanatory",
        "i":"indefinite/conditional",
        "e":"emphatic",
        "q":"interrogative"
    }
    ret_particle.append(part_spec_dict[particle])
    return ret_particle

def parse_adverbs(adverb: str) -> list:
    ret_adv = ['adverb']

    adv_dict = {
        "n":"negative",
        "m":"manner",
        "t":"time",
        "p":"place",
        "d":"degree",
        "i":"interrogative",
        "c":"comprehensive",
        "s":"superlative"
    }

    ret_adv.append(adverb[0])
    return ret_adv



In [192]:
def parse_rmac(rmac_code) -> list:
    print(rmac_code)
    if rmac_code in undeclined_forms_dict.keys():
        return undeclined_forms_dict[rmac_code]
    elif rmac_code[0] == "v":
        return (rmac_code)
    elif rmac_code[0] == "n":
        rmac_code = rmac_code[2:]
        return parse_noun(rmac_code)
    elif rmac_code[0:3] == "prt":
        return parse_particles(rmac_code[4:])
    elif rmac_code[0:3] == "adv":
        return parse_adverbs(rmac_code[-1])
    elif rmac_code[0] in noun_likes_dict.keys():
        return parse_noun_like(rmac_code)
    else: 
        return ["none"]
    # elif rmac_code == "2gs":
    #     return []
    # else:
    #     raise Exception(f" {rmac_code} failed! Not recognised: Not a verb/noun/noun-like/indeclinable \n noun-likes:{noun_likes_dict.keys()} \n indeclinables:{undeclined_forms_dict.keys()}")


In [193]:
# filtering rmacs
tisch_greek['rmac'] = tisch_greek['rmac'].str.lower()
tisch_greek = tisch_greek[~tisch_greek['rmac'].str.endswith('-')]

In [194]:
def get_speech(rmac: str):
    return parse_rmac(rmac)[0]

tisch_greek["part_speech"] = tisch_greek['rmac'].apply(get_speech)
tisch_greek.head()

n-nsf
n-gsf
n-gsm
n-gsm
n-gsm
n-pri
n-gsm
n-pri
n-pri
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-asm
conj
t-apm
n-apm
p-gsm
n-nsm
conj
v-aai-3s
t-asm
n-pri
conj
t-asm
n-pri
prep
t-gsf
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
prep
t-gsf
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
prep
t-gsf
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
t-asm
n-asm
n-pri
conj
v-aai-3s
t-asm
n-asm
prep
t-gsf
t-gsm
n-gsm
n-nsm
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-asm
n-nsm
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-asm
n-nsm
conj
v-aai-3s
t-asm
n-asm
n-nsm
conj
v-aai-3s
t-asm
n-pri
n-pri
conj
v-aai-3s
t-asm
n-asm

KeyError: 'u'

In [None]:
strongs_dictionary = pd.read_xml("../Data/strongs-dictionary.xhtml")
strongs_dictionary.head()

# using BeautifulSoup4
with open('../Data/strongs-dictionary.xhtml', 'r', encoding='utf-8') as file:
    soup = BeautifulSoup(file, features='lxml')

words = soup.find_all('li')
print(words[0].text)
strongs_dictionary = pd.DataFrame(words)

FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

In [195]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re



index_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/RMACindex.html"
rmac_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC"
response = requests.get(index_url)

soup = BeautifulSoup(response.text, 'html5lib')

all_links = soup.find('blockquote').find_all('a')

contents = []
hrefs = []
for link in all_links:
    hrefs.append(link.get('href'))
    contents.append(link.text.strip())


In [196]:
rmac_url = "https://www.modernliteralversion.org/bibles/bs2/RMAC/"
all_possible_options = []
def decode_page_noun(rmac_link: str):
    print(rmac_url+rmac_link)
    response = requests.get(rmac_url+rmac_link)
    rmac_soup = BeautifulSoup(response.text, 'html5lib')
    rows  = rmac_soup.find_all('tr')
    first_entry = rows[2]

    lines = first_entry.text.split("\n")
    meaningful_lines = []
    for line in lines:
        cleaned_item = line.strip()
        if re.search(r"\w+\: ",cleaned_item):
            cleaned_item = cleaned_item.strip()
            meaningful_lines.append(cleaned_item)
            
    print(meaningful_lines)

    rmac_decoded = []
    possible_options = []
    for line in meaningful_lines:
        rmac_decoded.append(re.search(r"(?<=: )\w+", line).group())
        all_possible_options.append(re.search(r"(\w+)\:", line).group())
    return rmac_decoded

In [198]:
rmac_options = set(all_possible_options)
print(rmac_options)

{'Speech:', 'Tense:', 'Gender:', 'Degree:', 'Form:', 'Case:', 'Number:', 'Voice:', 'Person:', 'Mood:'}


In [201]:

rmacs_df = pd.DataFrame(columns = list(rmac_options))
rmacs_df['rmac'] = contents
rmacs_df.head()

Unnamed: 0,Speech:,Tense:,Gender:,Degree:,Form:,Case:,Number:,Voice:,Person:,Mood:,rmac
0,,,,,,,,,,,A-APF
1,,,,,,,,,,,A-APF-C
2,,,,,,,,,,,A-APF-S
3,,,,,,,,,,,A-APM
4,,,,,,,,,,,A-APM-C


In [None]:
for rmac in rmacs_df['rmac']:
    

A-APF
A-APF-C
A-APF-S
A-APM
A-APM-C
A-APM-S
A-APN
A-APN-C
A-APN-S
A-ASF
A-ASF-C
A-ASF-N
A-ASF-S
A-ASM
A-ASM-C
A-ASM-N
A-ASM-S
A-ASN
A-ASN-C
A-ASN-N
A-ASN-S
A-DPF
A-DPF-C
A-DPF-S
A-DPM
A-DPM-C
A-DPM-S
A-DPN
A-DPN-S
A-DSF
A-DSF-C
A-DSF-S
A-DSM
A-DSM-C
A-DSM-N
A-DSM-S
A-DSN
A-DSN-C
A-DSN-N
A-DSN-S
A-GMS
A-GPF
A-GPF-S
A-GPM
A-GPM-C
A-GPM-S
A-GPN
A-GPN-C
A-GPN-S
A-GSF
A-GSF-C
A-GSF-S
A-GSM
A-GSM-C
A-GSM-N
A-GSM-S
A-GSN
A-GSN-N
A-GSN-S
A-NPF
A-NPF-C
A-NPF-S
A-NPM
A-NPM-C
A-NPM-S
A-NPN
A-NPN-C
A-NPN-S
A-NSF
A-NSF-C
A-NSF-N
A-NSF-S
A-NSM
A-NSM-ATT
A-NSM-C
A-NSM-N
A-NSM-S
A-NSN
A-NSN-C
A-NSN-N
A-NSN-S
A-NUI
A-NUI-ABB
A-VPM
A-VSF
A-VSM
A-VSM-S
A-VSN
ADV
ADV-C
ADV-I
ADV-K
ADV-N
ADV-S
ARAM
C-APM
C-DPM
C-DPN
C-GPM
C-GPN
COND
COND-K
CONJ
CONJ-N
D-APF
D-APM
D-APM-K
D-APN
D-APN-C
D-APN-K
D-ASF
D-ASM
D-ASM-K
D-ASN
D-DPF
D-DPM
D-DPM-C
D-DPM-K
D-DPN
D-DSF
D-DSM
D-DSN
D-GPF
D-GPM
D-GPN
D-GSF
D-GSM
D-GSN
D-NPF
D-NPM
D-NPM-K
D-NPN
D-NPN-K
D-NSF
D-NSM
D-NSM-K
D-NSN
F-1APM
F-1ASM
F-1DPM
F-1DSM
F-1GPM
F-1GSM
F