# Exploring OpenBible.info Data

This dataset by Stephen Smith has a collection of topics and associated verses, and was initiated in 2007. The initial set was about 4000 topics based on completing the phrase "What does the Bible say about..."). After that:

> I used the Yahoo Web Search API to get the top thirty webpages related to each topic and then extracted the verse references from each page.

(This approach likely included some irrelevant verses: in theory the voting approach should mitigate this.)

> This Bible is a mashup of the Yahoo! and ESV Bible web services. It searches the Internet for the topics that interest people, many of which you’d never find in a traditional topical Bible. Then it shows relevant verses.

Because of the methodology for adding topics, this data might be most useful if combined with other, more curated topic inventories.

On the resulting site, users are invited to 
- vote on the relevance of the verse to the topics
- suggest other verses for a topic
- suggest new topics

> Since launching three weeks ago, people have voted up or down 3,000 verses and suggested 200 new verses, in addition to creating 500 new topics.

The original site had a passage -> tag cloud feature that no longer appears to function.

## Scoring

The initial scores:

> each page got one vote per unique verse—so two references to John 1:1 on the same page would only count as one vote. All verses that appeared on two or more webpages made it into the main TB index.

(This suggests that any topic/passage pair which still has only two votes might be discardable.)

> About 750 of the topics occurred in both the new TB and in Nave’s; every verse for each topic in Nave’s got an extra three votes in the new TB.

* Initial release: [June « 2007 « OpenBible.info Blog](https://www.openbible.info/blog/2007/06/)
* [Topical Bible Technical Notes « OpenBible.info Blog](https://www.openbible.info/blog/2007/07/topical-bible-technical-notes/)
* Other blog posts on the category Topic: https://www.openbible.info/blog/category/topics/. This includes some interesting change-over-time analysis for some hot-button topics.

## Duplicate Topics

> Searching for a word will automatically add it.

This means there's some duplication that should probably be collapsed, e.g. 

```
Tatoos On The Body
...
Tattoo
Tattooing
Tattooing Your Body
Tattoos
Tattoos And Body Piercings
Tattoos And Piercings
Tattoos Body Piercings
```

## Updates

Note this data is still updated weekly: this snapshot is from 2024-08-05. It might be interesting to compare against previous versions to see whether the data is still growing, and how. For example, the letter T currently has 734 entries: it had the same number of entries in 2014.

The analysis below shows:
* There are now 6700 unique topics. So clearly some de-duping and consolidation may be needed.
* There is a "fat head" of topics with a vote count way above the mean

In [1]:
# Setup
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

from src.openbibleinfo import DATAPATH, reader
rd = reader.Reader()

AttributeError: module 'pandas' has no attribute 'Dataframe'

In [None]:
# some passages are ranges (EXO 20:1-26), others a single verse (Gal 5:14)
rd.df.head()

In [None]:
n_topics = len(rd.df.Topic.value_counts())
n_records = len(rd.df)
n_votes = rd.df.Votes.sum()
print(f"Number of uniq topics: \t{n_topics:9}")
print(f"Number of rows: \t{n_records:9}")
print(f"Number of votes: \t{n_votes:9}")

## Expanding the Data: Passage Length

Some verses have an `EndVerseId` value, indicating a range. This adds a `PassageLength` column. 
* The value is 1 if no `EndVerseId`
* If the `EndVerseId` is in a different chapter, arbitrarily return length == 99. My code can't currently enumerate these verses, and potentially too long anyway.
* If the `EndVerseId` is in a different book, return 999. These ranges seem way too long to be useful. There are 72 of them, some of them hot button topics. Less than 1/1000 votes. 

`UsableRange` is True if a range and length < 99. `UsableReference` is True if a single verse or `UsableRange` is True. 

In [None]:
# cross-chapter ranges
ccrecords = rd.df[rd.df.PassageLength==99]
print(f"{len(ccrecords)} cross-chapter records")
print(f"{(ccrecords.Votes.sum() / n_votes)*100:.2f}% of all votes")
ccrecords.Topic.value_counts()

In [None]:
# cross-book ranges: these will likely get dropped
cbrecords = rd.df[rd.df.PassageLength==999]
print(f"{(cbrecords.Votes.sum() / n_votes)*100:.3f}% of all votes")
cbrecords.Topic.value_counts()

In [None]:
# topicsdf removes passages whose UsableRange < 99
n_topics = len(rd.topicsdf.Topic.value_counts())
n_records = len(rd.topicsdf)
n_votes = rd.topicsdf.Votes.sum()
print(f"Number of uniq topics: \t{n_topics:9}")
print(f"Number of rows: \t{n_records:9}")
print(f"Number of votes: \t{n_votes:9}")

### Distribution of Passage Lengths

This excludes cross-chapter and cross-book ranges. 

Looks like lengths <= 5 cover about 90% of the data. So perhaps enumerate and split votes up to that length, and otherwise treat as atomic (perhaps with a constant epsilon weight)? 

In [None]:
print(f"Records with a 'good' PassageLength value: {len(rd.topicsdf)} ({(len(rd.topicsdf)/n_records)*100:.2f}%)")
rd.topicsdf.PassageLength.value_counts(normalize=True)

## Expanding the Data: aggregating intersecting references

Topic references sometimes intersect for a topic: for example "marrying a divorced woman" has votes for MAT 5:31, 5:32, and the range 5:31-32 (among others). Combining these would significantly increase the VotesPercentage for this passage.

In other cases there might be two ranges, one subsumed in the other (in this example, MAT 5:1-48 and MAT 5:21-48). Note the first reference (which has more votes) is the entire chapter. 

If we had a reasonable way to combine such references, that should sharpen the counts. Minimally, a single verse and a "short" range ought to be combinable: perhaps "short" here is PassageLength <= 3 or 5? 

The simplest initial step might be to combine single verses within a topic within an intersecting 2-verse range for that topic, allocating some portion of their votes: all? 0.75? 0.5? 

NOT YET ACCOMPLISHED.

In [None]:
marryingdivorced = rd.topicsdf[rd.topicsdf.Topic == "marrying a divorced woman"]
marryingdivorced[marryingdivorced.StartVerseId.str.startswith("40005")]

## "Top" Topics

From here onward, we only consider records where `UsableReference` is True. So either a range within a chapter, or a single verse. 

A TopicRecord combines a Topic label, a set of passages, and a count of votes. 
* `TopicPassageCount` is the count of passages for a topic
* `TopicVotesSum` are the sum of votes for a topic
* `MeanPassageVotes` is `TopicVotesSum`/`TopicPassageCount`


In [None]:
# 55 is the 80% value for MeanPassageVotes across rd.topicsdf
# so an approximation for "high number of votes per passage"
#toptopics = rd.top_topics(threshold=55)
#toptopics.sort_values('MeanPassageVotes', ascending=False).head(10)
rd.toptopics.head(10)

In [None]:
# the median TopicPassageCount value is 161: that still seems like way too many
rd.toptopics.describe()

In [None]:
# but not all "toptopics" seem "top"
rd.toptopics[rd.toptopics.TopicVotesSum < 1000].head()

In [None]:
# top 50% by TopicPassageCount, looking at the bottom
rd.toptopics[rd.toptopics.TopicPassageCount >= 156].tail()

In [None]:
# we might further filter rd.topicsdf to the top n passages for each topic in toptopics.Topic
# we could also choose a higher threshold for top_topics()
#rd.topicsdf[rd.topicsdf.Topic == "love one another"]
rd.topicsdf[rd.topicsdf.Topic == "immigration"]

### Top Passages for Top Topics

For rd.toptopics collect the verses whose 
`n_votes` is the total number of votes. 

### Quality Score

Stephen measured quality in topic-scores.txt as "percentage of votes for the passage". This is done here with `passage_probability()`, only computed for toptopics. 

In [None]:
# compute across all topics: this takes a few minutes
#rd.topicsdf["VotesPercentage"] = rd.topicsdf.apply(rd.passage_probability, axis=1)
# this approach seems equally slow and produces the same warningb
#rd.topicsdf["VotesPercentage"] = rd.topicsdf.apply(rd.passage_probability2, axis=1)
# not sure how to address the warning this produces
# this approach seems faster: still produces the SettingWithCopyWarning warning though
rd.topicsdf["VotesPercentage"] = [prob for tup in rd.topicsdf.itertuples()
                                  if (topic := tup.Topic)
                                  if (votesum := rd.topicvotesum.loc[topic].Votes)
                                  if (prob := int(tup.Votes)/votesum)]

In [None]:
rd.topicsdf.head()

In [None]:
# what's the distribution of VotesPercentage values?
rd.topicsdf.VotesPercentage.describe()

In [None]:
# compare the top passages by VotesPercentage for "alcohol": 8 vs 131
alcoholdf = rd.topicsdf[rd.topicsdf.Topic == "alcohol"]
print(f"All rows: {len(alcoholdf)}")
alcoholdf[alcoholdf.VotesPercentage > 0.02]

In [None]:
# write out top topics
toptopics.to_csv(DATAPATH / "OpenBible.info/toptopics.tsv", sep="\t")

In [None]:
# for each top topic, write out data for all verses
with (DATAPATH / "OpenBible.info/toptopicdata.tsv").open("w") as f:
    f.write("\t".join(["Index", "Topic", "StartVerseId", "EndVerseId", "Votes", "PassageLength", 
                       "UsableRange", "VotesPercentage"]) + "\n")
    for top in rd.toptopics.Topic:
        outdf = rd.topicsdf[rd.topicsdf.Topic == top].drop("UsableReference", axis="columns")
        outdf.to_csv(f, sep="\t", header=False)

In [None]:
# for each top topic, write out only rows whose VotesPercentage > 0.02
with (DATAPATH / "OpenBible.info/toptopictopversedata.tsv").open("w") as f:
    f.write("\t".join(["Index", "Topic", "StartVerseId", "EndVerseId", "Votes", "PassageLength", 
                       "UsableRange", "VotesPercentage"]) + "\n")
    for top in rd.toptopics.Topic:
        outdf = rd.topicsdf[rd.topicsdf.Topic == top].drop("UsableReference", axis="columns")
        # this figure is heuristic, based on the distribution of VotesPercentage values
        topversedf = outdf[outdf.VotesPercentage > 0.02]
        topversedf.to_csv(f, sep="\t", header=False)

In [None]:
from src import DATAPATH

In [None]:
stop

## Expanding the Data: Topic Overlap

Topics overlap due to:
* typos (tatoo vs tattoo)
* general/specific 
* synonymy

Approach for similar strings:
* set a threshold S for similarity
* Compare all pairs T1, T2
* For all pairs whose similarity > S, combine them
* Repeat, averaging similarity across members of each set TS1, TS2. Might need to relax S given the averaging strategy?

Once term sets are combined, combine their verse inventories, averaging their votes. Potentially discard verses with counts below a threshold? 


In [None]:
# verses for "helping others": 597 of them, so a much larger spread than some other topics
rd.display_topic_data("helping others")
# print(f"{len(df[df.Topic.str.startswith('helping others')])} verses with {df[df.Topic.str.startswith('helping others')].Votes.sum()} votes")

In [None]:
helptopic = "helping others"
helpdf = rd.df[rd.df.Topic == helptopic]
helpdf

In [None]:
helpvotesmedian = helpdf.Votes.median()
helpdf[helpdf.Votes > helpvotesmedian]

In [None]:
helpabovemedian = helpdf[helpdf.Votes >= helpvotesmedian]
helpbelowmedian = helpdf[helpdf.Votes < helpvotesmedian]
print(f"Votes above median: {len(helpabovemedian)} verses, {helpabovemedian.Votes.sum()} votes ")
print(f"Votes below median: {len(helpbelowmedian)} verses, {helpbelowmedian.Votes.sum()} votes ")

## Multi-word Topics

How many topics have multiple terms, and what's the distribution?

In [None]:
# get NLTK stopwords and make a superstring removing stop words
import nltk
from nltk.corpus import stopwords
english_stop_words = stopwords.words('english')
def slugify(string: str) -> str:
    """Remove stop words from string and join the results."""
    return "".join([s for s in string.split(" ") if s not in english_stop_words])


In [None]:
topicsdf = pd.DataFrame(rd.df.Topic.value_counts())
topicsdf = topicsdf.reset_index()
topicsdf.columns = ["Topic", "Count"]
#topicsdf
topicsdf["AllTermsCount"] = topicsdf.apply(lambda t: len(t.Topic.split(" ")), axis=1)
# remove stopwords
topicsdf["SlugTerm"] = topicsdf.apply(lambda t: slugify(t.Topic), axis=1)
topicsdf.head()

In [None]:
# make a big NxN matrix of all the SlugTerm values
from thefuzz import fuzz
slugterms = list(topicsdf.SlugTerm)
# Create a DataFrame with each string as a row, and all the strings as columns
slugdf = pd.DataFrame("" * len(slugterms), columns=slugterms, index=slugterms)

for i in range(len(slugterms)):
    st1 = slugterms[i]
    for j in range(i+1, len(slugterms)):
        st2 = slugterms[j]
        slugdf.loc[st1,st2] = fuzz.ratio(st1, st2)/100
slugdf.head()

In [None]:
len(slugterms)

In [None]:
# quite a few 2-3 word topics, and even one with 12 words! "were does the bible teach that jesus is the so..."
topicsdf.Termcount.value_counts()

In [None]:
topicsdf[topicsdf.Termcount==8].Topic

In [None]:
topicsdf[topicsdf.Termcount==12].Topic

In [None]:
# distributional statistics for Votes
# Unlike the website display, it looks like only verses with at least 10 votes are included in the downloaded data
# The median number of votes is 30
# The standard deviation is very large! A lot of strong outliers at the upper end apparently. 
rd.df.Votes.describe()

In [None]:
# focusing on the fat head
rd.df.Votes.describe(percentiles=[.75, .80, .85, .90, .95])

In [None]:
# the topic+verse with the most votes: "helping others"
rd.df[rd.df.Votes == 24420]

In [None]:
# other topics for the most popular verse. 
rd.df[rd.df.StartVerseId=="50002004"]

In [None]:
# 50002005 is within the range for "affliction" above: does it also occur as a start or end?
# yes: a lot! This suggests we need to enumerate ranges into their components for better verse counting
# but just multiplying e.g. a range of five into five rows would seriously overweight their votes. 
# Dividing their votes by the range seems like it might _underweight_ their votes. 
# It's the old "how to score ranges" problem. 
rd.df[(rd.df.StartVerseId=="50002005") | (rd.df.EndVerseId=="50002005")]

In [None]:
# comparing to verses for "tattoo*": 83 of them (
topicsubstr = "tattoo"
tattoodf = rd.df[rd.df.Topic.str.startswith(topicsubstr)]
print(f"{len(tattoodf)} verses with {tattoodf.Votes.sum()} votes")
tattoodf.Topic.value_counts()

In [None]:
from clearlib.util import listalign
listalign.compare(["tattoos", "and", "piercings"], ["tattoo", "and", "body", "piercings"])