# Exploring OpenBible.info Data

This dataset by Stephen Smith has a collection of topics and associated verses, and was initiated in 2007. The initial set was about 4000 topics based on completing the phrase "What does the Bible say about..."). After that:

> I used the Yahoo Web Search API to get the top thirty webpages related to each topic and then extracted the verse references from each page.

(This approach likely included some irrelevant verses: in theory the voting approach should mitigate this.)

> This Bible is a mashup of the Yahoo! and ESV Bible web services. It searches the Internet for the topics that interest people, many of which you’d never find in a traditional topical Bible. Then it shows relevant verses.

Because of the methodology for adding topics, this data might be most useful if combined with other, more curated topic inventories.

On the resulting site, users are invited to 
- vote on the relevance of the verse to the topics
- suggest other verses for a topic
- suggest new topics

> Since launching three weeks ago, people have voted up or down 3,000 verses and suggested 200 new verses, in addition to creating 500 new topics.

The original site had a passage -> tag cloud feature that no longer appears to function.

## Scoring

The initial scores:

> each page got one vote per unique verse—so two references to John 1:1 on the same page would only count as one vote. All verses that appeared on two or more webpages made it into the main TB index.

(This suggests that any topic/passage pair which still has only two votes might be discardable.)

> About 750 of the topics occurred in both the new TB and in Nave’s; every verse for each topic in Nave’s got an extra three votes in the new TB.

* Initial release: [June « 2007 « OpenBible.info Blog](https://www.openbible.info/blog/2007/06/)
* [Topical Bible Technical Notes « OpenBible.info Blog](https://www.openbible.info/blog/2007/07/topical-bible-technical-notes/)
* Other blog posts on the category Topic: https://www.openbible.info/blog/category/topics/. This includes some interesting change-over-time analysis for some hot-button topics.

## Duplicate Topics

> Searching for a word will automatically add it.

This means there's some duplication that should probably be collapsed, e.g. 

```
Tatoos On The Body
...
Tattoo
Tattooing
Tattooing Your Body
Tattoos
Tattoos And Body Piercings
Tattoos And Piercings
Tattoos Body Piercings
```

## Updates

Note this data is still updated weekly: this snapshot is from 2024-08-05. It might be interesting to compare against previous versions to see whether the data is still growing, and how. For example, the letter T currently has 734 entries: it had the same number of entries in 2014.

The analysis below shows:
* There are now 6700 unique topics. So clearly some de-duping and consolidation may be needed.
* There is a "fat head" of topics with a vote count way above the mean

In [1]:
from src.openbibleinfo import reader
rd = reader.Reader()

In [2]:
# some passages are ranges (EXO 20:1-26), others a single verse (Gal 5:14)
rd.df.head()

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes
0,10 commandments,2020001,2020026.0,291
1,10 commandments,48005014,,140
2,10 commandments,45013008,45013010.0,114
3,10 commandments,5004013,,101
4,10 commandments,2034028,,93


In [3]:
n_topics = len(rd.df.Topic.value_counts())
n_records = len(rd.df)
n_votes = rd.df.Votes.sum()
print(f"Number of uniq topics: \t{n_topics:9}")
print(f"Number of rows: \t{n_records:9}")
print(f"Number of votes: \t{n_votes:9}")

Number of uniq topics: 	     6713
Number of rows: 	   465956
Number of votes: 	 73692178


In [4]:
# distributional statistics for Votes
# Unlike the website display, it looks like only verses with at least 10 votes are included in the downloaded data
# The median number of votes is 30
# The standard deviation is very large! A lot of strong outliers at the upper end apparently. 
rd.df.Votes.describe()

count    465956.000000
mean        158.152654
std         502.690678
min          10.000000
25%          15.000000
50%          30.000000
75%          89.000000
max       24420.000000
Name: Votes, dtype: float64

In [5]:
# focusing on the fat head
rd.df.Votes.describe(percentiles=[.75, .80, .85, .90, .95])

count    465956.000000
mean        158.152654
std         502.690678
min          10.000000
50%          30.000000
75%          89.000000
80%         124.000000
85%         188.000000
90%         329.000000
95%         711.000000
max       24420.000000
Name: Votes, dtype: float64

In [6]:
# the topic+verse with the most votes: "helping others"
rd.df[rd.df.Votes == 24420]

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes
191397,helping others,50002004,,24420


## Expanding the Data: Passage Length

Some verses have an `EndVerseId` value, indicating a range. This adds a `PassageLength` column. 
* The value is 1 if no `EndVerseId`
* If the `EndVerseId` is in a different chapter, arbitrarily return length == 99
* If the `EndVerseId` is in a different book, return 1 (I should mark such cases so they can be treated as single verses later)


In [7]:
rd.df["PassageLength"] = rd.df.apply(lambda row: rd.passagelen(row.StartVerseId, row.EndVerseId), axis=1)

Bad range 62003006, 64001009: returning 1
Bad range 01050021, 02002019: returning 1
Bad range 49001004, 50004019: returning 1
Bad range 52004001, 56002015: returning 1
Bad range 60001018, 61003002: returning 1
Bad range 01050021, 02001022: returning 1
Bad range 60004014, 61002022: returning 1
Bad range 50004019, 52001010: returning 1
Bad range 52005001, 53001012: returning 1
Bad range 62002002, 64001009: returning 1
Bad range 54003003, 55004003: returning 1
Bad range 52004001, 53003018: returning 1
Bad range 49006001, 50002030: returning 1
Bad range 60005007, 61001004: returning 1
Bad range 60001022, 61002002: returning 1
Bad range 60005001, 61001021: returning 1
Bad range 60005002, 61001004: returning 1
Bad range 52002003, 53002011: returning 1
Bad range 54006011, 55004011: returning 1
Bad range 54006011, 55002022: returning 1
Bad range 60005006, 61001004: returning 1
Bad range 52004001, 53002017: returning 1
Bad range 52005003, 53001009: returning 1
Bad range 63001009, 64001009: retu

In [8]:
rd.df["UsableRange"] = rd.df.apply(rd.usablerange, axis=1)

In [9]:
rd.df.head()

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes,PassageLength,UsableRange
0,10 commandments,2020001,2020026.0,291,26,True
1,10 commandments,48005014,,140,1,False
2,10 commandments,45013008,45013010.0,114,3,True
3,10 commandments,5004013,,101,1,False
4,10 commandments,2034028,,93,1,False


## Expanding the Data: Topic Overlap

* Should create another dataframe indicating the degree of overlap between pairs of topics. This could either indicate duplicate topics ("tattoo*"), or synonymous topic labels, or otherwise related topics.

In [10]:
# verses for "helping others": 597 of them, so a much larger spread than some other topics
rd.display_topic_data("helping others")
# print(f"{len(df[df.Topic.str.startswith('helping others')])} verses with {df[df.Topic.str.startswith('helping others')].Votes.sum()} votes")

597 verses with 829782 votes
                 Topic StartVerseId EndVerseId  Votes  PassageLength  \
191397  helping others     50002004       <NA>  24420              1   
191398  helping others     58013016       <NA>  23916              1   
191399  helping others     48006002       <NA>  23261              1   
191400  helping others     20019017       <NA>  22118              1   
191401  helping others     42006038       <NA>  21705              1   
...                ...          ...        ...    ...            ...   
191989  helping others     42012006       <NA>     10              1   
191990  helping others     19072004       <NA>     10              1   
191991  helping others     23058005       <NA>     10              1   
191992  helping others     45013001   45013014     10             14   
191993  helping others     23001004       <NA>     10              1   

        UsableRange  
191397        False  
191398        False  
191399        False  
191400        Fals

In [11]:
helptopic = "helping others"
helpdf = rd.df[rd.df.Topic == helptopic]
helpdf

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes,PassageLength,UsableRange
191397,helping others,50002004,,24420,1,False
191398,helping others,58013016,,23916,1,False
191399,helping others,48006002,,23261,1,False
191400,helping others,20019017,,22118,1,False
191401,helping others,42006038,,21705,1,False
...,...,...,...,...,...,...
191989,helping others,42012006,,10,1,False
191990,helping others,19072004,,10,1,False
191991,helping others,23058005,,10,1,False
191992,helping others,45013001,45013014,10,14,True


In [12]:
helpvotesmedian = helpdf.Votes.median()
helpdf[helpdf.Votes > helpvotesmedian]

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes,PassageLength,UsableRange
191397,helping others,50002004,,24420,1,False
191398,helping others,58013016,,23916,1,False
191399,helping others,48006002,,23261,1,False
191400,helping others,20019017,,22118,1,False
191401,helping others,42006038,,21705,1,False
...,...,...,...,...,...,...
191689,helping others,20022022,20022023,88,2,True
191690,helping others,58013017,,88,1,False
191691,helping others,46010013,,87,1,False
191692,helping others,51003023,51003024,87,2,True


In [13]:
helpabovemedian = helpdf[helpdf.Votes >= helpvotesmedian]
helpbelowmedian = helpdf[helpdf.Votes < helpvotesmedian]
print(f"Votes above median: {len(helpabovemedian)} verses, {helpabovemedian.Votes.sum()} votes ")
print(f"Votes below median: {len(helpbelowmedian)} verses, {helpbelowmedian.Votes.sum()} votes ")

Votes above median: 303 verses, 820125 votes 
Votes below median: 294 verses, 9657 votes 


In [14]:
# other topics for the most popular verse. 
rd.df[rd.df.StartVerseId=="50002004"]

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes,PassageLength,UsableRange
7913,affliction,50002004,50002008,13,5,True
9810,ambition,50002004,,104,1,False
18898,attitude,50002004,,219,1,False
23138,bad leadership,50002004,,144,1,False
26550,bearing each others burdens,50002004,,37,1,False
...,...,...,...,...,...,...
421006,thinking,50002004,,11,1,False
425613,training,50002004,,20,1,False
436980,unselfishness,50002004,,124,1,False
438065,vaccinations,50002004,,256,1,False


In [15]:
# 50002005 is within the range for "affliction" above: does it also occur as a start or end?
# yes: a lot! This suggests we need to enumerate ranges into their components for better verse counting
# but just multiplying e.g. a range of five into five rows would seriously overweight their votes. 
# Dividing their votes by the range seems like it might _underweight_ their votes. 
# It's the old "how to score ranges" problem. 
rd.df[(rd.df.StartVerseId=="50002005") | (rd.df.EndVerseId=="50002005")]

Unnamed: 0,Topic,StartVerseId,EndVerseId,Votes,PassageLength,UsableRange
8715,agreement,50002001,50002005,13,5,True
14674,approachability,50002001,50002005,13,5,True
15409,arianism,50002005,50002011,242,7,True
18756,attention seekers,50002003,50002005,75,3,True
18771,attitude,50002005,,4404,1,False
...,...,...,...,...,...,...
449703,who am i in christ,50002005,,127,1,False
462062,worthy,50002001,50002005,10,5,True
463186,yoke,50002005,,22,1,False
463456,you are a royal priesthood,50002005,50002008,26,4,True


In [16]:
# comparing to verses for "tattoo*": 83 of them (
topicsubstr = "tattoo"
print(f"{len(df[df.Topic.str.startswith(topicsubstr)])} verses with {df[df.Topic.str.startswith(topicsubstr)].Votes.sum()} votes")
df[df.Topic.str.startswith(topicsubstr)].Topic.value_counts()

NameError: name 'df' is not defined