# Statistical Tests

Name: **Isaac Anderson**

Date: **16 Sept 2025**

## Hypothesis Tests

### Means
1. Identify 5 theologically salient lemmas from the top 20 lemmas (end of PS1) and compute per verse rates by book.
2. Are the per-verse rates between Synoptic Gospels and John the same for all 5 lemmas.
3. Report effect sizes using Cohen's d along with confidence intervals.
4. Correct these results for multiple tests using the Bonferroni correction.
5. Correct these results for multiple tests using the Benjamini-Hochberg correction.
6. Why are the numbers from the Bonferroni correction different from the Benjamini-Hochberg correction? Which results should you use and why?

### Proportion
7. Are the proportion of term presence vs term absence in verse the same between Synoptic Gospels and John for all 5 lemmas?
8. Report effect sizes using Cohen's d along with confidence intervals.
9. Correct these results for multiple tests using the Bonferroni correction.
10. Correct these results for multiple tests using the Benjamini-Hochberg correction.
6. Why are the numbers from the Bonferroni correction different from the Benjamini-Hochberg correction? Which results should you use and why?

### ANOVA
12. Do a term rate ANOVA comparison on a theologically salient lemma from the top 20 lemmas across the following groups: {Matthew and Mark, Luke and Acts, Johannine books, Pauline books, all other books}
13. Perform a Post-hoc Tukey HSD to see which pairs differ

### Chi-square Test
14. Build a 2xk contingency table for collocation: presence of target term vs top-k companion terms within verse.
15. Run Chi-square tests of independence; compute standardized residuals to find surprising co-occurrences
16. Visualize a heatmap of residuals
17. Explain results

18. Repeat the ANOVA and Chi-square test sections for 2 other terms.

## Power Analysis Extension

18. Conduct a power analysis for the question in #7 to determine how many verses are needed to detect the specified effect.
13. Simulate the answer.
    <ol type="a">
    <li>Simulate n verses with the proportions found in #8 for Synoptic Gospels and John. Do a test to see if significantly different and save this result.</li>
    <li>Repeat step a many times.</li>
    <li>Calculate power: power = (number of significant findings) / (total simulations)</li>
    <li>Repeat steps a-c for a different n.</li>
    <li>Plot power as a function of number of verses.</li>
    <li>At what number of verses is the power at least 80%? i.e. This is the number of verses you need per text to have an 80% chance of detecting the effect if it's really there.</li>
    </ol>

## Sliding Windows Stretch
21. Repeat ANOVA and Chi-square sections on the same three theological terms, but instead of using verse boundaries, use sliding windows (+/- n tokens)
22. How do results of verse boundaries compare to using sliding windows?

## **Problem 1**: Identify 5 theologically salient lemmas from the top 20 lemmas (end of PS1) and compute per verse rates by book.

In [23]:
import pandas as pd

new_testament = pd.read_pickle('../Data/nt.pickle')
top20 = pd.read_pickle("../Data/zipf-top20.pickle")
strongs_dictionary = pd.read_pickle("./pickles/ps1/strongs_dictionary")

In [24]:
print(strongs_dictionary.columns)

Index(['greek-word', 'kjv-def'], dtype='object')


##### Finding "Salient" top 20 words (I.E Verbs/Nouns/Adjectives/Adverbs)

In [30]:
new_testament = pd.merge(
    strongs_dictionary,
    new_testament, 
    left_index=True, 
    right_on="strong"
)


In [None]:
def part_of_speech(pared_rmac: str) -> str:
    return pared_rmac.split()[0]

new_testament['part_speech'] = new_testament['parsed_rmac'].apply(part_of_speech)

verbs  = new_testament[new_testament["part_speech"] == "Verb"]
nouns = new_testament[new_testament["part_speech"] == "Noun"]
adjectives = new_testament[new_testament["part_speech"] == "Adjective"]
adverbs = new_testament[new_testament["part_speech"] == "Adverb"]

salient_nt = pd.concat(objs=[verbs, nouns, adjectives, adverbs])
print(salient_nt)
print(salient_nt['strong'].value_counts()[""])


             word      rmac  strong  book  chapter  verse  \
9       ἐγέννησεν  V-AAI-3S    1080    40        1      2   
14      ἐγέννησεν  V-AAI-3S    1080    40        1      2   
19      ἐγέννησεν  V-AAI-3S    1080    40        1      2   
28      ἐγέννησεν  V-AAI-3S    1080    40        1      3   
39      ἐγέννησεν  V-AAI-3S    1080    40        1      3   
...           ...       ...     ...   ...      ...    ...   
137319      ταχύ,       ADV    5035    66       22     12   
137328         ὡς       ADV    5613    66       22     12   
137372        ἔξω       ADV    1854    66       22     15   
137441    δωρεάν.       ADV    1432    66       22     17   
137513      ταχύ.       ADV    5035    66       22     20   

                                              parsed_rmac  \
9       Verb - Aorist - Active - Indicative - 3rd - Si...   
14      Verb - Aorist - Active - Indicative - 3rd - Si...   
19      Verb - Aorist - Active - Indicative - 3rd - Si...   
28      Verb - Aorist -

1. κύριος from kuros (supremacy); supreme in authority
2. υἱός apparently a primary word; a "son"
3. Ἰησοῦς of Hebrew origin (יְהוֹשׁ֫וּעַ); Jesus (i.e. Jehoshua)
4. εἰμί (I exist) Used for when being emphatic. 
5. πατήρ apparently a primary word; a "father"

##### Finding the "per verse" rates by book.

In [None]:
salient_words = ["κύριος", "υἱός", "Ἰησοῦς", "εἰμί", "πατήρ"]
salient_nt = new_testament[new_testament['word'].isin(salient_words)]
print(salient_nt['word'].value_counts())
# def per_verse(group):


# Collect occurences of words per book
words_in_each_gospel = salient_nt.pivot_table(
    index= "book",
    columns = "word",
    aggfunc = "size",
)

# Collect total number of verses per book
verses_per_chapter = salient_nt.groupby(['book','chapter'])['verse'].max()
verses_per_book = verses_per_chapter.groupby('book').sum()  


per_verse_df = pd.merge(verses_per_book, words_in_each_gospel, on="book")
per_verse_df.fillna(0, inplace=True)
per_verse_df = per_verse_df[["κύριος","πατήρ", "υἱός", "Ἰησοῦς"]].div(per_verse_df['verse'], axis=0)
print(per_verse_df)




word
Ἰησοῦς    314
κύριος    128
υἱός       21
πατήρ      19
Name: count, dtype: int64
        κύριος     πατήρ      υἱός    Ἰησοῦς
book                                        
40    0.023316  0.007772  0.003886  0.110104
41    0.018692  0.000000  0.004673  0.109813
42    0.029021  0.003628  0.009674  0.051995
43    0.001422  0.014225  0.004267  0.152205
44    0.055556  0.000000  0.003472  0.027778
45    0.057471  0.000000  0.000000  0.011494
46    0.085714  0.000000  0.000000  0.028571
47    0.060976  0.000000  0.000000  0.024390
48    0.500000  0.000000  0.000000  0.500000
50    0.125000  0.000000  0.000000  0.062500
51    0.000000  0.000000  0.000000  0.090909
52    0.142857  0.000000  0.000000  0.071429
53    0.156250  0.000000  0.000000  0.062500
54    0.032258  0.000000  0.000000  0.064516
55    0.135593  0.000000  0.000000  0.000000
58    0.052632  0.000000  0.026316  0.026316
59    0.076923  0.000000  0.000000  0.000000
60    0.000000  0.000000  0.076923  0.000000
61    0.08571

## Problem 2

In [None]:
synoptics = new_testament.query("book in [40,41,42]")
john = new_testament[new_testament["book"] == 43]

display(synoptics.head())
display(john.head())

Unnamed: 0,word,rmac,strong,book,chapter,verse,parsed_rmac,strong_definition,part_speech
0,Βίβλος,N-NSF,976,40,1,1,Noun - Nominative - Singular - Feminine,"βίβλος properly, the inner bark of the papyrus...",Noun
1,γενέσεως,N-GSF,1078,40,1,1,Noun - Genitive - Singular - Feminine,γένεσις from the same as γενεά; nativity; figu...,Noun
2,Ἰησοῦ,N-GSM,2424,40,1,1,Noun - Genitive - Singular - Masculine,Ἰησοῦς of Hebrew origin (יְהוֹשׁ֫וּעַ); Jesus (i....,Noun
3,Χριστοῦ,N-GSM,5547,40,1,1,Noun - Genitive - Singular - Masculine,"Χριστός from χρίω; anointed, i.e. the Messiah,...",Noun
4,υἱοῦ,N-GSM,5207,40,1,1,Noun - Genitive - Singular - Masculine,"υἱός apparently a primary word; a ""son"" (somet...",Noun


Unnamed: 0,word,rmac,strong,book,chapter,verse,parsed_rmac,strong_definition,part_speech
48927,Ἐν,PREP,1722,43,1,1,Preposition,ἐν a primary preposition denoting (fixed) posi...,Preposition
48928,ἀρχῇ,N-DSF,746,43,1,1,Noun - Dative - Singular - Feminine,ἀρχή from ἄρχομαι; (properly abstract) a comme...,Noun
48929,ἦν,V-IAI-3S,1510,43,1,1,Verb - Imperfect - Active - Indicative - 3rd -...,εἰμί the first person singular present indicat...,Verb
48930,ὁ,T-NSM,3588,43,1,1,Definite article - Nominative - Singular - Mas...,"ὁ, including the feminine he, and the neuter t...",Definite
48931,"λόγος,",N-NSM,3056,43,1,1,Noun - Nominative - Singular - Masculine,λόγος from λέγω; something said (including the...,Noun
