### Downloading the Dataset

In [1]:
import pymupdf4llm
import requests
from tqdm import tqdm

ModuleNotFoundError: No module named 'pymupdf4llm'

In [6]:
urls = [
    "https://bibliamundi.com/wp-content/uploads/2023/09/Hiligaynon-Maayong-Balita-nga-Biblia-All-Bible.pdf",
    "https://bibliamundi.com/wp-content/uploads/2023/09/Cebuano-All-Bible.pdf",
    "https://bibliamundi.com/wp-content/uploads/2023/09/Filipino-Magandang-Balita-Biblia-Revised-All-Bible.pdf"
]

for url in urls:
    filename = url.split("/")[-1]
    print(f"Downloading {filename}...")
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Saved to {filename}")
    else:
        print(f"Failed to download {url}")

Downloading Hiligaynon-Maayong-Balita-nga-Biblia-All-Bible.pdf...
Saved to Hiligaynon-Maayong-Balita-nga-Biblia-All-Bible.pdf
Downloading Cebuano-All-Bible.pdf...
Saved to Cebuano-All-Bible.pdf
Downloading Filipino-Magandang-Balita-Biblia-Revised-All-Bible.pdf...
Saved to Filipino-Magandang-Balita-Biblia-Revised-All-Bible.pdf


### Converting PDFs to Text Files

In [5]:
bible_files = ["Cebuano-All-Bible","Hiligaynon-Ang-Pulong-Sang-Dios-All-Bible",
               "Hiligaynon-Maayong-Balita-nga-Biblia-All-Bible"]

for file in tqdm(bible_files):
    md_text = pymupdf4llm.to_markdown(f"{file}.pdf", show_progress=True)
    with open(f'{file}.md', 'w') as f: f.write(md_text)

  0%|                                                                                                                         | 0/3 [00:00<?, ?it/s]

Processing Cebuano-All-Bible.pdf...


 33%|█████████████████████████████████████▎                                                                          | 1/3 [02:46<05:32, 166.14s/it]

Processing Hiligaynon-Ang-Pulong-Sang-Dios-All-Bible.pdf...


 67%|██████████████████████████████████████████████████████████████████████████▋                                     | 2/3 [04:53<02:23, 143.59s/it]

Processing Filipino-Magandang-Balita-Biblia-Revised-All-Bible.pdf...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [07:43<00:00, 154.39s/it]


## First Look
The Bible datasets used in this notebook were primarily sourced from the **Philippine Bible Society**, the leading institution responsible for translating and publishing Scripture in various Philippine languages. All selected versions were obtained from [Bibliamundi](https://bibliamundi.com), a digital platform dedicated to aggregating and freely distributing multilingual editions of the Bible in accessible formats such as PDF.

### Hiligaynon: Maayong Balita nga Biblia (en: The Good News Bible)
The Maayong Balita nga Biblia (2012) is a dynamic-equivalence translation of the Bible into Hiligaynon, designed to communicate the message of Scripture in a clear and natural style. It was translated directly from the original Hebrew and Greek texts by the Philippine Bible Society, using widely accepted scholarly sources such as the Masoretic Text and the UBS Greek New Testament.

This edition adopts an informal yet faithful tone to make the text accessible to contemporary readers. It primarily serves the approximately 7 million Hiligaynon speakers living in Western Visayas and surrounding regions in the Philippines.

### Cebuano: Ang Bag-ong Maayong Balita (en: The New Good News Bible)
The Ang Bag-ong Maayong Balita Biblia is a Cebuano based translation of the bible. It was published by the Philippine Bible Society in 1999. This bible also adapts a dynamic-equivalence philosophy, which focuses on converying the meaning of the original texts rather than word-for-word translation. It also uses the Masoretic Texts for the Old Testament and the UBS Greek New Testament.

Cebuano is spoken by 20 m