# Statistical Tests

Name: **Isaac Anderson**

Date: **16 Sept 2025**

## Hypothesis Tests

### Means
1. Identify 5 theologically salient lemmas from the top 20 lemmas (end of PS1) and compute per verse rates by book.
2. Are the per-verse rates between Synoptic Gospels and John the same for all 5 lemmas.
3. Report effect sizes using Cohen's d along with confidence intervals.
4. Correct these results for multiple tests using the Bonferroni correction.
5. Correct these results for multiple tests using the Benjamini-Hochberg correction.
6. Why are the numbers from the Bonferroni correction different from the Benjamini-Hochberg correction? Which results should you use and why?

### Proportion
7. Are the proportion of term presence vs term absence in verse the same between Synoptic Gospels and John for all 5 lemmas?
8. Report effect sizes using Cohen's d along with confidence intervals.
9. Correct these results for multiple tests using the Bonferroni correction.
10. Correct these results for multiple tests using the Benjamini-Hochberg correction.
6. Why are the numbers from the Bonferroni correction different from the Benjamini-Hochberg correction? Which results should you use and why?

### ANOVA
12. Do a term rate ANOVA comparison on a theologically salient lemma from the top 20 lemmas across the following groups: {Matthew and Mark, Luke and Acts, Johannine books, Pauline books, all other books}
13. Perform a Post-hoc Tukey HSD to see which pairs differ

### Chi-square Test
14. Build a 2xk contingency table for collocation: presence of target term vs top-k companion terms within verse.
15. Run Chi-square tests of independence; compute standardized residuals to find surprising co-occurrences
16. Visualize a heatmap of residuals
17. Explain results

18. Repeat the ANOVA and Chi-square test sections for 2 other terms.

## Power Analysis Extension

18. Conduct a power analysis for the question in #7 to determine how many verses are needed to detect the specified effect.
13. Simulate the answer.
    <ol type="a">
    <li>Simulate n verses with the proportions found in #8 for Synoptic Gospels and John. Do a test to see if significantly different and save this result.</li>
    <li>Repeat step a many times.</li>
    <li>Calculate power: power = (number of significant findings) / (total simulations)</li>
    <li>Repeat steps a-c for a different n.</li>
    <li>Plot power as a function of number of verses.</li>
    <li>At what number of verses is the power at least 80%? i.e. This is the number of verses you need per text to have an 80% chance of detecting the effect if it's really there.</li>
    </ol>

## Sliding Windows Stretch
21. Repeat ANOVA and Chi-square sections on the same three theological terms, but instead of using verse boundaries, use sliding windows (+/- n tokens)
22. How do results of verse boundaries compare to using sliding windows?

## **Problem 1**: Identify 5 theologically salient lemmas from the top 20 lemmas (end of PS1) and compute per verse rates by book.

In [3]:
import pandas as pd

new_testament = pd.read_pickle('../Data/nt.pickle')
top20 = pd.read_pickle("../Data/zipf-top20.pickle")
dictionary = pd.read_pickle("../Data/dictionary.pickle")

##### Finding "Salient" top 20 words (I.E Verbs/Nouns/Adjectives/Adverbs)

In [4]:
def part_of_speech(pared_rmac: str) -> str:
    return pared_rmac.split()[0]

new_testament['part_speech'] = new_testament['parsed_rmac'].apply(part_of_speech)

verbs  = new_testament[new_testament["part_speech"] == "Verb"]
nouns = new_testament[new_testament["part_speech"] == "Noun"]
adjectives = new_testament[new_testament["part_speech"] == "Adjective"]
Adverbs = new_testament[new_testament["part_speech"] == "Adverb"]

salient_nt = pd.concat(objs=[verbs, nouns, adjectives])

salient_top_20 = salient_nt['strong_definition'].value_counts()[:20]
print(type(salient_top_20))
print(*salient_top_20.keys(), sep="\n\n")


<class 'pandas.core.series.Series'>
εἰμί the first person singular present indicative; a prolonged form of a primary and defective verb; I exist (used only when emphatic): am, have been, X it is I, was. See also εἶ, εἴην, εἶναι, εἷς καθ’ εἷς, ἦν, ἔσομαι, ἐσμέν, ἐστέ, εἰμί0, εἰμί1, εἰμί2, εἰμί3.

λέγω a primary verb; properly, to "lay" forth, i.e. (figuratively) relate (in words (usually of systematic or set discourse; whereas ἔπω and φημί generally refer to an individual expression or speech respectively; while ῥέω is properly to break silence merely, and λαλέω means an extended or random harangue)); by implication, to mean: ask, bid, boast, call, describe, give out, name, put forth, say(-ing, on), shew, speak, tell, utter.

θεός of uncertain affinity; a deity, especially (with ὁ) the supreme Divinity; figuratively, a magistrate; by Hebraism, very: X exceeding, God, god(-ly, -ward).

πᾶς including all the forms of declension; apparently a primary word; all, any, every, the whole: all (

1. κύριος from kuros (supremacy); supreme in authority
2. υἱός apparently a primary word; a "son"
3. Ἰησοῦς of Hebrew origin (יְהוֹשׁ֫וּעַ); Jesus (i.e. Jehoshua)
4. εἰμί (I exist) Used for when being emphatic. 
5. πατήρ apparently a primary word; a "father"

##### Finding the "per verse" rates by book.

In [None]:
salient_words = ["κύριος", "υἱός", "Ἰησοῦς", "εἰμί", "πατήρ"]
salient_nt = new_testament[new_testament['word'].isin(salient_words)]

group_by_salient = salient_nt.groupby(by=['word','book'])
occurences_per_book = group_by_salient.size()

occurences_per_book = pd.DataFrame(data=occurences_per_book, columns=["occurences"])
display(occurences_per_book)


TypeError: Index(...) must be called with a collection of some kind, 'occurences' was passed

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f6b1e4927b0>


## Problem 2

In [39]:
synoptics = new_testament.query("book in [40,41,42]")
john = new_testament[new_testament["book"] == 43]

display(synoptics.head())
display(john.head())

Unnamed: 0,word,rmac,strong,book,chapter,verse,parsed_rmac,strong_definition,part_speech
0,Βίβλος,N-NSF,976,40,1,1,Noun - Nominative - Singular - Feminine,"βίβλος properly, the inner bark of the papyrus...",Noun
1,γενέσεως,N-GSF,1078,40,1,1,Noun - Genitive - Singular - Feminine,γένεσις from the same as γενεά; nativity; figu...,Noun
2,Ἰησοῦ,N-GSM,2424,40,1,1,Noun - Genitive - Singular - Masculine,Ἰησοῦς of Hebrew origin (יְהוֹשׁ֫וּעַ); Jesus (i....,Noun
3,Χριστοῦ,N-GSM,5547,40,1,1,Noun - Genitive - Singular - Masculine,"Χριστός from χρίω; anointed, i.e. the Messiah,...",Noun
4,υἱοῦ,N-GSM,5207,40,1,1,Noun - Genitive - Singular - Masculine,"υἱός apparently a primary word; a ""son"" (somet...",Noun


Unnamed: 0,word,rmac,strong,book,chapter,verse,parsed_rmac,strong_definition,part_speech
48927,Ἐν,PREP,1722,43,1,1,Preposition,ἐν a primary preposition denoting (fixed) posi...,Preposition
48928,ἀρχῇ,N-DSF,746,43,1,1,Noun - Dative - Singular - Feminine,ἀρχή from ἄρχομαι; (properly abstract) a comme...,Noun
48929,ἦν,V-IAI-3S,1510,43,1,1,Verb - Imperfect - Active - Indicative - 3rd -...,εἰμί the first person singular present indicat...,Verb
48930,ὁ,T-NSM,3588,43,1,1,Definite article - Nominative - Singular - Mas...,"ὁ, including the feminine he, and the neuter t...",Definite
48931,"λόγος,",N-NSM,3056,43,1,1,Noun - Nominative - Singular - Masculine,λόγος from λέγω; something said (including the...,Noun
