# Exploring OpenBible.info Data

This dataset by Stephen Smith has a collection of topics and associated verses, and was initiated in 2007. The initial set was about 4000 topics based on completing the phrase "What does the Bible say about..."). After that:

> I used the Yahoo Web Search API to get the top thirty webpages related to each topic and then extracted the verse references from each page.

(This approach likely included some irrelevant verses: in theory the voting approach should mitigate this.)

> This Bible is a mashup of the Yahoo! and ESV Bible web services. It searches the Internet for the topics that interest people, many of which you’d never find in a traditional topical Bible. Then it shows relevant verses.

Because of the methodology for adding topics, this data might be most useful if combined with other, more curated topic inventories.

On the resulting site, users are invited to 
- vote on the relevance of the verse to the topics
- suggest other verses for a topic
- suggest new topics

> Since launching three weeks ago, people have voted up or down 3,000 verses and suggested 200 new verses, in addition to creating 500 new topics.

The original site had a passage -> tag cloud feature that no longer appears to function.

## Scoring

The initial scores:

> each page got one vote per unique verse—so two references to John 1:1 on the same page would only count as one vote. All verses that appeared on two or more webpages made it into the main TB index.

(This suggests that any topic/passage pair which still has only two votes might be discardable.)

> About 750 of the topics occurred in both the new TB and in Nave’s; every verse for each topic in Nave’s got an extra three votes in the new TB.

* Initial release: [June « 2007 « OpenBible.info Blog](https://www.openbible.info/blog/2007/06/)
* [Topical Bible Technical Notes « OpenBible.info Blog](https://www.openbible.info/blog/2007/07/topical-bible-technical-notes/)
* Other blog posts on the category Topic: https://www.openbible.info/blog/category/topics/. This includes some interesting change-over-time analysis for some hot-button topics.

## Duplicate Topics

> Searching for a word will automatically add it.

This means there's some duplication that should probably be collapsed, e.g. 

```
Tatoos On The Body
...
Tattoo
Tattooing
Tattooing Your Body
Tattoos
Tattoos And Body Piercings
Tattoos And Piercings
Tattoos Body Piercings
```

## Updates

Note this data is still updated weekly: this snapshot is from 2024-08-05. It might be interesting to compare against previous versions to see whether the data is still growing, and how. For example, the letter T currently has 734 entries: it had the same number of entries in 2014.

The analysis below shows:
* There are now 6700 unique topics. So clearly some de-duping and consolidation may be needed.
* There is a "fat head" of topics with a vote count way above the mean

In [1]:
# Setup
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

from src.openbibleinfo import reader
rd = reader.Reader()

In [2]:
# some passages are ranges (EXO 20:1-26), others a single verse (Gal 5:14)
rd.df.head()

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes,PassageLength,UsableRange
0,10 commandments,2020001,2020026.0,291,26,True
1,10 commandments,48005014,,140,1,False
2,10 commandments,45013008,45013010.0,114,3,True
3,10 commandments,5004013,,101,1,False
4,10 commandments,2034028,,93,1,False


In [3]:
n_topics = len(rd.df.Topic.value_counts())
n_records = len(rd.df)
n_votes = rd.df.Votes.sum()
print(f"Number of uniq topics: \t{n_topics:9}")
print(f"Number of rows: \t{n_records:9}")
print(f"Number of votes: \t{n_votes:9}")

Number of uniq topics: 	     6713
Number of rows: 	   465956
Number of votes: 	 73692178


## Expanding the Data: Passage Length

Some verses have an `EndVerseId` value, indicating a range. This adds a `PassageLength` column. 
* The value is 1 if no `EndVerseId`
* If the `EndVerseId` is in a different chapter, arbitrarily return length == 99. My code can't currently enumerate these verses, and potentially too long anyway.
* If the `EndVerseId` is in a different book, return 999. These ranges seem way too long to be useful. There are 72 of them, some of them hot button topics. Less than 1/1000 votes. 


In [4]:
# cross-chapter ranges
ccrecords = rd.df[rd.df.PassageLength==99]
print(f"{len(ccrecords)} cross-chapter records")
print(f"{(ccrecords.Votes.sum() / n_votes)*100:.2f}% of all votes")
ccrecords.Topic.value_counts()

8373 cross-chapter records
0.95% of all votes


Topic
islam               131
being single        107
hate                101
abortion             96
sports               90
                   ... 
annoying people       1
ankh                  1
animal cruelty        1
anger management      1
666                   1
Name: count, Length: 1235, dtype: Int64

In [5]:
# cross-book ranges: these will likely get dropped
cbrecords = rd.df[rd.df.PassageLength==999]
print(f"{(cbrecords.Votes.sum() / n_votes)*100:.3f}% of all votes")
cbrecords.Topic.value_counts()

0.005% of all votes


Topic
islam                    7
being single             6
abortion                 6
christmas                5
video games              3
                        ..
violence                 1
watching tv              1
atheists                 1
losing your salvation    1
yoga                     1
Name: count, Length: 72, dtype: Int64

In [6]:
# topicsdf removes passages whose UsableRange < 99
n_topics = len(rd.topicsdf.Topic.value_counts())
n_records = len(rd.topicsdf)
n_votes = rd.topicsdf.Votes.sum()
print(f"Number of uniq topics: \t{n_topics:9}")
print(f"Number of rows: \t{n_records:9}")
print(f"Number of votes: \t{n_votes:9}")

Number of uniq topics: 	     6713
Number of rows: 	   465956
Number of votes: 	 73692178


### Distribution of Passage Lengths

This excludes cross-chapter and cross-book ranges. 

Looks like lengths <= 5 cover about 90% of the data. So perhaps enumerate and split votes up to that length, and otherwise treat as atomic (perhaps with a constant epsilon weight)? 

In [7]:
print(f"Records with a 'good' PassageLength value: {len(rd.topicsdf)} ({(len(rd.topicsdf)/n_records)*100:.2f}%)")
rd.topicsdf.PassageLength.value_counts(normalize=True)

Records with a 'good' PassageLength value: 465956 (100.00%)


PassageLength
1      0.709885
2      0.091275
3      0.041639
4      0.020180
99     0.017970
         ...   
107    0.000002
76     0.000002
70     0.000002
110    0.000002
79     0.000002
Name: proportion, Length: 82, dtype: float64

## "Top" Topics

A TopicRecord combines a Topic label, a set of passages, and a count of votes. 
* `TopicVotesSum` are the sum of votes for a topic
* `TopicPassageCount` is the count of passages for a topic
* `MeanPassageVotes` is `TopicVotesSum`/`TopicPassageCount`

In [15]:
# 55 is the 80% value for MeanPassageVotes across rd.topicsdf
# so an approximation for "high number of votes per passage"
toptopics = rd.top_topics(threshold=55)
toptopics.sort_values('MeanPassageVotes', ascending=False).head(10)

Unnamed: 0,Topic,TopicPassageCount,TopicVotesSum,MeanPassageVotes
2785,helping others,597,829782,1389.919598
3027,immigration,367,432590,1178.719346
1812,eating pork,410,464953,1134.031707
5352,sodomy,480,481264,1002.633333
489,being born again,264,262186,993.128788
4529,power of words,761,711287,934.674113
1338,covenant,326,297725,913.266871
2866,homosexuality,323,293903,909.916409
5317,slavery,548,496198,905.470803
2577,gossip,424,379530,895.117925


In [16]:
# the median TopicPassageCount value is 161: that still seems like way too many
toptopics.describe()

Unnamed: 0,TopicPassageCount,TopicVotesSum,MeanPassageVotes
count,1355.0,1355.0,1355.0
mean,190.954982,49327.308487,179.582226
std,135.306835,85410.272764,176.034914
min,1.0,66.0,55.023364
25%,109.0,7811.0,66.862163
50%,161.0,15804.0,96.71223
75%,244.0,52222.0,220.242695
max,1023.0,829782.0,1389.919598


In [17]:
# we might further filter rd.topicsdf to the top n passages for each topic in toptopics.Topic
# we could also choose a higher threshold for top_topics()
rd.topicsdf[rd.topicsdf.Topic == "love one another"]

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes,PassageLength,UsableRange
251548,love one another,43013034,,791,1,False
251549,love one another,62004020,,532,1,False
251550,love one another,60004008,,489,1,False
251551,love one another,43013034,43013035,441,2,True
251552,love one another,45012010,,375,1,False
...,...,...,...,...,...,...
251695,love one another,46006007,,11,1,False
251696,love one another,20030018,20030019,10,2,True
251697,love one another,45014019,,10,1,False
251698,love one another,49002004,49002005,10,2,True


## Expanding the Data: Topic Overlap

Topics overlap due to:
* typos (tatoo vs tattoo)
* general/specific 
* synonymy

Approach for similar strings:
* set a threshold S for similarity
* Compare all pairs T1, T2
* For all pairs whose similarity > S, combine them
* Repeat, averaging similarity across members of each set TS1, TS2. Might need to relax S given the averaging strategy?

Once term sets are combined, combine their verse inventories, averaging their votes. Potentially discard verses with counts below a threshold? 


In [None]:
# verses for "helping others": 597 of them, so a much larger spread than some other topics
rd.display_topic_data("helping others")
# print(f"{len(df[df.Topic.str.startswith('helping others')])} verses with {df[df.Topic.str.startswith('helping others')].Votes.sum()} votes")

In [None]:
helptopic = "helping others"
helpdf = rd.df[rd.df.Topic == helptopic]
helpdf

In [None]:
helpvotesmedian = helpdf.Votes.median()
helpdf[helpdf.Votes > helpvotesmedian]

In [None]:
helpabovemedian = helpdf[helpdf.Votes >= helpvotesmedian]
helpbelowmedian = helpdf[helpdf.Votes < helpvotesmedian]
print(f"Votes above median: {len(helpabovemedian)} verses, {helpabovemedian.Votes.sum()} votes ")
print(f"Votes below median: {len(helpbelowmedian)} verses, {helpbelowmedian.Votes.sum()} votes ")

## Multi-word Topics

How many topics have multiple terms, and what's the distribution?

In [None]:
# get NLTK stopwords and make a superstring removing stop words
import nltk
from nltk.corpus import stopwords
english_stop_words = stopwords.words('english')
def slugify(string: str) -> str:
    """Remove stop words from string and join the results."""
    return "".join([s for s in string.split(" ") if s not in english_stop_words])


In [None]:
topicsdf = pd.DataFrame(rd.df.Topic.value_counts())
topicsdf = topicsdf.reset_index()
topicsdf.columns = ["Topic", "Count"]
#topicsdf
topicsdf["AllTermsCount"] = topicsdf.apply(lambda t: len(t.Topic.split(" ")), axis=1)
# remove stopwords
topicsdf["SlugTerm"] = topicsdf.apply(lambda t: slugify(t.Topic), axis=1)
topicsdf.head()

In [None]:
# make a big NxN matrix of all the SlugTerm values
from thefuzz import fuzz
slugterms = list(topicsdf.SlugTerm)
# Create a DataFrame with each string as a row, and all the strings as columns
slugdf = pd.DataFrame("" * len(slugterms), columns=slugterms, index=slugterms)

for i in range(len(slugterms)):
    st1 = slugterms[i]
    for j in range(i+1, len(slugterms)):
        st2 = slugterms[j]
        slugdf.loc[st1,st2] = fuzz.ratio(st1, st2)/100
slugdf.head()

In [None]:
len(slugterms)

In [None]:
# quite a few 2-3 word topics, and even one with 12 words! "were does the bible teach that jesus is the so..."
topicsdf.Termcount.value_counts()

In [None]:
topicsdf[topicsdf.Termcount==8].Topic

In [None]:
topicsdf[topicsdf.Termcount==12].Topic

In [None]:
# distributional statistics for Votes
# Unlike the website display, it looks like only verses with at least 10 votes are included in the downloaded data
# The median number of votes is 30
# The standard deviation is very large! A lot of strong outliers at the upper end apparently. 
rd.df.Votes.describe()

In [None]:
# focusing on the fat head
rd.df.Votes.describe(percentiles=[.75, .80, .85, .90, .95])

In [None]:
# the topic+verse with the most votes: "helping others"
rd.df[rd.df.Votes == 24420]

In [None]:
# other topics for the most popular verse. 
rd.df[rd.df.StartVerseId=="50002004"]

In [None]:
# 50002005 is within the range for "affliction" above: does it also occur as a start or end?
# yes: a lot! This suggests we need to enumerate ranges into their components for better verse counting
# but just multiplying e.g. a range of five into five rows would seriously overweight their votes. 
# Dividing their votes by the range seems like it might _underweight_ their votes. 
# It's the old "how to score ranges" problem. 
rd.df[(rd.df.StartVerseId=="50002005") | (rd.df.EndVerseId=="50002005")]

In [None]:
# comparing to verses for "tattoo*": 83 of them (
topicsubstr = "tattoo"
tattoodf = rd.df[rd.df.Topic.str.startswith(topicsubstr)]
print(f"{len(tattoodf)} verses with {tattoodf.Votes.sum()} votes")
tattoodf.Topic.value_counts()

In [None]:
from clearlib.util import listalign
listalign.compare(["tattoos", "and", "piercings"], ["tattoo", "and", "body", "piercings"])