Skip to content

Conversion step 2: BOK to mARkdown

pverkind edited this page Mar 14, 2023 · 3 revisions

The BOK files are in fact mdb database files. They contain many tables (for an overview, see here), of which the most important are:

  • main: contains the metadata
  • bxxxx(in which xxxx is the shamela book id): contains the text of the book
  • txxxx(in which xxxx is the shamela book id): contains the table of contents of the book

MacOS

BOK > JSON:

The easiest way to access the bokfiles in MacOS is to read and convert the bok file to json format using the command-line app jetread.

_04_jetScript_all_tables.sh

#!/bin/bash
# extracting data from MDB files

srcFolder="2_bok"
trgFolder="3_json"

mkdir $trgFolder

cd ./$srcFolder
for mdbfile in *.bok
do
    echo $mdbfile
    ../jetread $mdbfile export -fmt json > ../$trgFolder/$mdbfile.json
done

JSON > mARkdown:

Use the shamela_converter from the openiti Python library to convert the json file into mARkdown:

from openiti.new_books.convert.shamela_converter import BokJsonConverter

conv = BokJsonConverter()
conv.convert_files_in_folder("./3_json", dest_folder="./4_openITImARkdown")

Windows

Using Windows Subsystem for Linux

If WSL is installed, the easiest way to convert the bok files is to convert them to json format first, and then use a Python script to build the text files from these json files.

bok to json:

Use this bash script (requires mdbtools and miller packages; to install those: run sudo apt-get install mdbtools miller)

Usage: bash ./_03_bok2json.sh ./2_bok ./3_json

json to mARkdown:

Use the shamela_converter from the openiti Python library to convert the json file into mARkdown:

from openiti.new_books.convert.shamela_converter import BokJsonConverter

conv = BokJsonConverter()
conv.convert_files_in_folder("./3_json", dest_folder="./4_openITImARkdown")

Without WSL:

A Python script is used to convert the bok files directly (NB: this method seems to create problems on Windows 11):

  • the database in the bok files is accessed using the pypyodbc library
  • the metadata is extracted from the main table
  • the script loops through the bxxxx table and builds the text string:
    • the text fragment is deNoised
    • footnotes are removed and appended to an endnotes string, with the page number
    • page numbers are added to the text fragment
    • structural markup is added to the text fragment by a re.sub operation using the tit and lvl fields of the txxxx table
    • the text fragment is added to the text string
  • the metadata and text are then joined and saved to file

NB: the pypyodbc library has trouble converting the Windows-1256 formatted strings in the bok file to utf-8; in order to deal with the encoding problem, the text and endnotes strings are written to a temp file, which is then converted to utf-8

Clone this wiki locally