### Downloading the Dataset

In [3]:
import pymupdf4llm
import requests
from tqdm import tqdm
import re

In [6]:
urls = [
    "https://bibliamundi.com/wp-content/uploads/2023/09/Hiligaynon-Maayong-Balita-nga-Biblia-All-Bible.pdf",
    "https://bibliamundi.com/wp-content/uploads/2023/09/Cebuano-All-Bible.pdf",
    "https://bibliamundi.com/wp-content/uploads/2023/09/Filipino-Magandang-Balita-Biblia-Revised-All-Bible.pdf"
]

for url in urls:
    filename = url.split("/")[-1]
    print(f"Downloading {filename}...")
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Saved to {filename}")
    else:
        print(f"Failed to download {url}")

Downloading Hiligaynon-Maayong-Balita-nga-Biblia-All-Bible.pdf...
Saved to Hiligaynon-Maayong-Balita-nga-Biblia-All-Bible.pdf
Downloading Cebuano-All-Bible.pdf...
Saved to Cebuano-All-Bible.pdf
Downloading Filipino-Magandang-Balita-Biblia-Revised-All-Bible.pdf...
Saved to Filipino-Magandang-Balita-Biblia-Revised-All-Bible.pdf


### Converting PDFs to Text Files

In [5]:
bible_files = ["Cebuano-All-Bible","Hiligaynon-Ang-Pulong-Sang-Dios-All-Bible",
               "Hiligaynon-Maayong-Balita-nga-Biblia-All-Bible"]

for file in tqdm(bible_files):
    md_text = pymupdf4llm.to_markdown(f"{file}.pdf", show_progress=True)
    with open(f'{file}.md', 'w') as f: f.write(md_text)

  0%|                                                                                                                         | 0/3 [00:00<?, ?it/s]

Processing Cebuano-All-Bible.pdf...


 33%|█████████████████████████████████████▎                                                                          | 1/3 [02:46<05:32, 166.14s/it]

Processing Hiligaynon-Ang-Pulong-Sang-Dios-All-Bible.pdf...


 67%|██████████████████████████████████████████████████████████████████████████▋                                     | 2/3 [04:53<02:23, 143.59s/it]

Processing Filipino-Magandang-Balita-Biblia-Revised-All-Bible.pdf...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [07:43<00:00, 154.39s/it]


# First Look
The Bible datasets used in this notebook were primarily sourced from the **Philippine Bible Society**, the leading institution responsible for translating and publishing Scripture in various Philippine languages. All selected versions were obtained from [Bibliamundi](https://bibliamundi.com), a digital platform dedicated to aggregating and freely distributing multilingual editions of the Bible in accessible formats such as PDF.

### Hiligaynon: Maayong Balita nga Biblia (en: The Good News Bible)
The Maayong Balita nga Biblia (2012) is a dynamic-equivalence translation of the Bible into Hiligaynon, designed to communicate the message of Scripture in a clear and natural style. It was translated directly from the original Hebrew and Greek texts by the Philippine Bible Society, using widely accepted scholarly sources such as the Masoretic Text and the UBS Greek New Testament.

This edition adopts an informal yet faithful tone to make the text accessible to contemporary readers. It primarily serves the approximately 7 million Hiligaynon speakers living in Western Visayas and surrounding regions in the Philippines.

### Cebuano: Ang Bag-ong Maayong Balita (en: The Good News Bible)
The Ang Bag-ong Maayong Balita Biblia is a Cebuano based translation of the bible. It was published by the Philippine Bible Society in 1999. This bible also adapts a dynamic-equivalence philosophy, which focuses on converying the meaning of the original texts rather than word-for-word translation. It also uses the Masoretic Texts for the Old Testament and the UBS Greek New Testament. One notable thing about this edition and other editions of the Good News Bible used within this notebook is the lack of the Deuterocanonical Books.

Cebuano is spoken by 20 million people in Central and Eastern Visayas, as well as some parts of Mindanao and other migrant communities. This version is more commonly used in Protestant and evangelical sources.

### Tagalog: Magandang Balita Biblia (en: The Good News Bible)
The Magandang Balita Biblia is a Tagalog Bible Translation first published by the Philippine Bible Society in 1973. This notebook uses the Revised 2005 edition. There are 2012 editions of both the Cebuano and Tagalog translations of this edition of the bible, however they are not available for download. Similar to both the Cebuano and Hiligaynon editions of the bible, this translation also uses the Masoretic Texts. Furthermore, the deuterocanonical books aren't also present within this edition of the bible. However, all three translations of the bible with deuterocanon could be found from other sources in [dead tree](https://www.lazada.com.ph/products/pdp-i3864887974-s29222713708.html).

Tagalog is spoken by over 29 million native speakers in Luzon with L2 speakers in other regions of the country as well. This edution of the bible is used by both both Catholic and Protestants for personal study and academic research.

# Hiligaynon Section

## Pre-Processing of the Hiligaynon Text

When we convert the Hiligaynon Bible PDF into plain text, certain recurring artifacts become apparent. For example:

```plaintext
Maayong Balita nga Biblia
© 2012 Philippine Bible Society.

19
```

Here, the lines represent the title, the copyright notice, and a standalone page number carried over from the original PDF. Our first step is to remove any lines that consist solely of a numeric page indicator or contain only copyright information.

Additionally, the document’s header periodically reappears in this form:

```plaintext
2
# **Hiligaynon – All Bible**

Maayong Balita nga Biblia

Maayong Balita nga Biblia
© 2012 Philippine Bible Society.

3

# **Hiligaynon – All Bible**

Maayong Balita nga Biblia
```

To eliminate these redundancies, we strip out every line beginning with a Markdown header (`#`) as well as any line composed of a single numeral. Once these artifacts are removed, only the core biblical text remains. For example:

```plaintext
*Genesis 1*

**Ang Pagtuga**

1 Sa ginsuguran gintuga sang Dios ang kalangitan kag ang duta.

2 Ang duta wala pa sing sulod kag porma. Ang mabalod nga kadagatan nga
nagatabon sang tanan nalikupan sang bug-os nga kadudulman kag ang
Espiritu sang Dios nagahulag sa ibabaw sang tubig.

3 Dayon ang Dios nagmando, “Magluntad ang kapawa!” Kag may kapawa.

4 Kag ang Dios nanamian sang iya nakita. Dayon ginpain niya ang kapawa sa
kadudulman.

5 Kag gintawag niya ang kapawa nga “Adlaw,” kag ang kadudulman nga “Gabi.”
Nagligad ang gab-i, kag nag-aga, amo yadto ang nahauna nga adlaw.
```

Fundamentally, the cleaned text follows this structure:

```plaintext
)+
    **{BOOK_NAME} {CHAPTER_NUM}**
    
    **{CHAPTER_NAME}**
    
        (
        **{VERSE_NUM}** {VERSE_TEXT}
        ?({VERSE_TEXT})*
        )+
)+
```

Our immediate goal is to remove all `BOOK*`, `CHAPTER*`, and `VERSE_NUM` markers, leaving only the verse text itself. After stripping those elements, the output should appear simply as:

```plaintext
{VERSE_TEXT}
{VERSE_TEXT}
...
```

A high-level view of the preprocessing pipeline:

```mermaid
flowchart TD
    A[Raw PDF → Plain Text]
    B[Remove page numbers & copyright lines]
    C[Strip Markdown headers]
    D[Eliminate BOOK & CHAPTER headings]
    E[Remove verse numbers]
    F[Cleaned verse-only text]

    A --> B --> C --> D --> E --> F
```

This streamlined text format will be instrumental when building a parallel corpus later on.


### Part A: Removing Page Numbers and Copyright Lines
Removing the page numbers in the text is quite simple, targeting lone digits within the text, These have the form of `^\d+\n\n`. Note that the second `\n` isn't optional as making it optional would also target verses which have the following structure

```plaintext
{VERSE_NUM}
{VERSE_TEXT}
```

In [None]:
RE_PAGE_NUM = re.compile(r'^\d+\n\n')
RE_COPYRIGHT = re.compile(r'