# Scraping Text Data From Bible Hub

### 1. Import necessary Python libraries

In [27]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

<br>
<br>

### 2. Create Dataframe of Books and Chapters for automating GET requests.

#### Finding the data

To scrape text data from Bible Hub for each book/chapter of the Bible,<br> 
I need the **names of the books** and **number of chapters** in each,
as the URL is formated like this:

...biblehub.com/niv/<span style="color:#ffeedd;">{BookName}</span>
/<span style="color:#ffeedd;">{Chapter}</span>.htm

I found a CSV file with the information I need here:
https://github.com/jpoehls/bible-metadata/blob/master/Books.csv 

So now I use pandas to read 'Books.csv' to get the data into a pandas dataframe:

In [31]:
df = pd.read_csv('Books.csv')
df 

Unnamed: 0,BookID,OsisID,BookName,TotalChapters,Volume
0,1,Gen,Genesis,50,OT
1,2,Exod,Exodus,40,OT
2,3,Lev,Leviticus,27,OT
3,4,Num,Numbers,36,OT
4,5,Deut,Deuteronomy,34,OT
...,...,...,...,...,...
61,62,1John,1 John,5,NT
62,63,2John,2 John,1,NT
63,64,3John,3 John,1,NT
64,65,Jude,Jude,1,NT


#### Fixing data for place in URL

I need to alter data in the BookNames column as the URLs should not contain spaces. <br>
Thus, "1 John" must become "1_John", which I can do with Python.

The one exception to this is "Song of Solomon" which should be just 'songs' <br>
which I will change manually after converting the table back to CSV.


In [32]:
df['BookName'] = df['BookName'].str.replace(' ', '_')
df.to_csv('Modified_Books.csv', index=False)
columns_to_read = ['BookName', 'TotalChapters']
mdf = pd.read_csv('Modified_Books.csv', usecols=columns_to_read)
mdf

Unnamed: 0,BookName,TotalChapters
0,Genesis,50
1,Exodus,40
2,Leviticus,27
3,Numbers,36
4,Deuteronomy,34
...,...,...
61,1_John,5
62,2_John,1
63,3_John,1
64,Jude,1


<br>
<br>

### 3. Test the GET request to fetch data from the website.

In [33]:
url = 'https://biblehub.com/niv/genesis/1.htm'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')
print(soup)

<html><body><p>ï»¿<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
</p><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1" name="viewport"/><title>Genesis 1 NIV</title><link href="/chapnew2.css" media="Screen" rel="stylesheet" type="text/css"/><link href="../spec.css" media="Screen" rel="stylesheet" type="text/css"/><link href="/print.css" media="Print" rel="stylesheet" type="text/css"/><div id="fx"><table border="0" cellpadding="0" cellspacing="0" id="fx2" width="100%"><tr><td><iframe align="left" frameborder="0" height="30" scrolling="no" src="../cmenus/genesis/1.htm" width="100%"></iframe></td></tr></table></div><div id="blnk"></div><div align="center"><table border="0" cellpadding="0" cellspacing="0" class="maintable" width="100%"><tr><td><div id="fx5"><table border="0" cellpadding="0" cellspacing="0" id="fx6" width="100%"><tr><td><iframe fram

<br>
<br>

### 4. Extract desired text

From inspecting the HTML I see the data I want is in a div with id "padleft".  So I can extract this and remove unwanted elements like headings and footnotes.

In [39]:
chap_div = soup.find('div', class_='padleft')
chap_soup = BeautifulSoup(str(chap_div), 'html.parser')

# Find unwanted elements within chap_soup
v_heading = chap_soup.find('div', class_='vheading')
find_span = chap_soup.find('span', class_='mainfootnotes')
section_head = chap_soup.find('p', class_='sectionhead')
bot_box = chap_soup.find('div', id='botbox')

# Extract unwanted elements
if v_heading:
    v_heading.extract()

if find_span:
    find_span.extract()

if section_head:
    section_head.extract()

if bot_box:
    bot_box.extract()


<br>
<br>

#### Now I have just the text I want.

In [40]:
chapter = chap_soup.text
print(chapter)



1In the beginning God created the heavens and the earth. 
2Now the earth was formless and empty, darkness was over the surface of the deep, and the Spirit of God was hovering over the waters.
3And God said, “Let there be light,” and there was light. 
4God saw that the light was good, and he separated the light from the darkness. 
5God called the light “day,” and the darkness he called “night.” And there was evening, and there was morning—the first day.
6And God said, “Let there be a vault between the waters to separate water from water.” 
7So God made the vault and separated the water under the vault from the water above it. And it was so. 
8God called the vault “sky.” And there was evening, and there was morning—the second day.
9And God said, “Let the water under the sky be gathered to one place, and let dry ground appear.” And it was so. 
10God called the dry ground “land,” and the gathered waters he called “seas.” And God saw that it was good.
11Then God said, “Let the land produce 

### 5. Use regular expression to find numbers in the text. <br>
### Create dataframe with number as 'VerseID' and text that follows it as 'VerseText'

In [42]:
verse_pattern = re.compile(r'(\d+)([^0-9]+)')
verses = verse_pattern.findall(chapter)

# Create a DataFrame from the verses
cdf = pd.DataFrame(verses, columns=['VerseID', 'VerseText'])

# Convert VerseID to integer
cdf['VerseID'] = cdf['VerseID'].astype(int)
cdf

Unnamed: 0,VerseID,VerseText
0,1,In the beginning God created the heavens and t...
1,2,"Now the earth was formless and empty, darkness..."
2,3,"And God said, “Let there be light,” and there ..."
3,4,"God saw that the light was good, and he separa..."
4,5,"God called the light “day,” and the darkness h..."
5,6,"And God said, “Let there be a vault between th..."
6,7,So God made the vault and separated the water ...
7,8,God called the vault “sky.” And there was even...
8,9,"And God said, “Let the water under the sky be ..."
9,10,"God called the dry ground “land,” and the gath..."


### Setup For Loop for iterating through all chapter URLs 

In [43]:
base_url = 'https://biblehub.com/niv/'

# Iterate through the DataFrame
for index, row in mdf.iterrows():
    book_name = row['BookName']
    total_chapters = row['TotalChapters']
    
    # Generate URLs for each chapter
    for ch_num in range(1, total_chapters + 1):
        durl = f'{base_url}{book_name}/{ch_num}.htm'
        print(durl)

https://biblehub.com/niv/Genesis/1.htm
https://biblehub.com/niv/Genesis/2.htm
https://biblehub.com/niv/Genesis/3.htm
https://biblehub.com/niv/Genesis/4.htm
https://biblehub.com/niv/Genesis/5.htm
https://biblehub.com/niv/Genesis/6.htm
https://biblehub.com/niv/Genesis/7.htm
https://biblehub.com/niv/Genesis/8.htm
https://biblehub.com/niv/Genesis/9.htm
https://biblehub.com/niv/Genesis/10.htm
https://biblehub.com/niv/Genesis/11.htm
https://biblehub.com/niv/Genesis/12.htm
https://biblehub.com/niv/Genesis/13.htm
https://biblehub.com/niv/Genesis/14.htm
https://biblehub.com/niv/Genesis/15.htm
https://biblehub.com/niv/Genesis/16.htm
https://biblehub.com/niv/Genesis/17.htm
https://biblehub.com/niv/Genesis/18.htm
https://biblehub.com/niv/Genesis/19.htm
https://biblehub.com/niv/Genesis/20.htm
https://biblehub.com/niv/Genesis/21.htm
https://biblehub.com/niv/Genesis/22.htm
https://biblehub.com/niv/Genesis/23.htm
https://biblehub.com/niv/Genesis/24.htm
https://biblehub.com/niv/Genesis/25.htm
https://b

### Combine into the for loop, code necessary for scraping text from given URL 
### and outputting dataframe of verses. The book names had to made lowercase for URL to work.

In [52]:
base_url = 'https://biblehub.com/niv/'
all_verses = []

# Iterate through the DataFrame
for index, row in mdf.iterrows():
    book_name = row['BookName'].lower()
    total_chapters = row['TotalChapters']
    
    # Generate URLs for each chapter
    for ch_num in range(1, total_chapters + 1):
        url = f'{base_url}{book_name}/{ch_num}.htm'
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html')
        chap_div = soup.find('div', class_='padleft')
        chap_soup = BeautifulSoup(str(chap_div), 'html.parser')
        v_heading = chap_soup.find('div', class_='vheading')
        find_span = chap_soup.find('span', class_='mainfootnotes')
        section_head = chap_soup.find('p', class_='sectionhead')
        bot_box = chap_soup.find('div', id='botbox')
        if v_heading:
            v_heading.extract()

        if find_span:
            find_span.extract()

        if section_head:
            section_head.extract()

        if bot_box:
            bot_box.extract()

        chapter = chap_soup.text
        verse_pattern = re.compile(r'(\d+)([^0-9]+)')
        verses = verse_pattern.findall(chapter)
        
        all_verses.extend([(book_name, ch_num, verse[0], verse[1]) for verse in verses])
        print(f'Getting text from {book_name} Chapter {ch_num}.')

# Create a DataFrame from all verses
cdf = pd.DataFrame(all_verses, columns=['BookName', 'ChapterNum', 'VerseID', 'VerseText'])
cdf['VerseID'] = cdf['VerseID'].astype(int)


Getting text from genesis Chapter 1.
Getting text from genesis Chapter 2.
Getting text from genesis Chapter 3.
Getting text from genesis Chapter 4.
Getting text from genesis Chapter 5.
Getting text from genesis Chapter 6.
Getting text from genesis Chapter 7.
Getting text from genesis Chapter 8.
Getting text from genesis Chapter 9.
Getting text from genesis Chapter 10.
Getting text from genesis Chapter 11.
Getting text from genesis Chapter 12.
Getting text from genesis Chapter 13.
Getting text from genesis Chapter 14.
Getting text from genesis Chapter 15.
Getting text from genesis Chapter 16.
Getting text from genesis Chapter 17.
Getting text from genesis Chapter 18.
Getting text from genesis Chapter 19.
Getting text from genesis Chapter 20.
Getting text from genesis Chapter 21.
Getting text from genesis Chapter 22.
Getting text from genesis Chapter 23.
Getting text from genesis Chapter 24.
Getting text from genesis Chapter 25.
Getting text from genesis Chapter 26.
Getting text from gen

In [53]:
cdf

Unnamed: 0,BookName,ChapterNum,VerseID,VerseText
0,genesis,1,1,In the beginning God created the heavens and t...
1,genesis,1,2,"Now the earth was formless and empty, darkness..."
2,genesis,1,3,"And God said, “Let there be light,” and there ..."
3,genesis,1,4,"God saw that the light was good, and he separa..."
4,genesis,1,5,"God called the light “day,” and the darkness h..."
...,...,...,...,...
32055,revelation,22,17,"The Spirit and the bride say, “Come!” And let ..."
32056,revelation,22,18,I warn everyone who hears the words of the pro...
32057,revelation,22,19,And if anyone takes words away from this scrol...
32058,revelation,22,20,"He who testifies to these things says, “Yes, I..."


### Convert dataframe to CSV file.

In [54]:
csv_file_path = 'Bible.csv'

# Use the to_csv() method to export the DataFrame to a CSV file
cdf.to_csv(csv_file_path, index=False) 