Conversion step 2: BOK to mARkdown

The BOK files are in fact mdb database files. They contain many tables (for an overview, see here), of which the most important are:

main: contains the metadata
bxxxx(in which xxxx is the shamela book id): contains the text of the book
txxxx(in which xxxx is the shamela book id): contains the table of contents of the book

MacOS

BOK > JSON:

The easiest way to access the bokfiles in MacOS is to read and convert the bok file to json format using the command-line app jetread.

_04_jetScript_all_tables.sh

#!/bin/bash
# extracting data from MDB files

srcFolder="2_bok"
trgFolder="3_json"

mkdir $trgFolder

cd ./$srcFolder
for mdbfile in *.bok
do
    echo $mdbfile
    ../jetread $mdbfile export -fmt json > ../$trgFolder/$mdbfile.json
done

JSON > mARkdown:

Use the shamela_converter from the openiti Python library to convert the json file into mARkdown:

from openiti.new_books.convert.shamela_converter import BokJsonConverter

conv = BokJsonConverter()
conv.convert_files_in_folder("./3_json", dest_folder="./4_openITImARkdown")

Windows

Using Windows Subsystem for Linux

If WSL is installed, the easiest way to convert the bok files is to convert them to json format first, and then use a Python script to build the text files from these json files.

bok to json:

Use this bash script (requires mdbtools and miller packages; to install those: run sudo apt-get install mdbtools miller)

Usage: bash ./_03_bok2json.sh ./2_bok ./3_json

json to mARkdown:

Use the shamela_converter from the openiti Python library to convert the json file into mARkdown:

from openiti.new_books.convert.shamela_converter import BokJsonConverter

conv = BokJsonConverter()
conv.convert_files_in_folder("./3_json", dest_folder="./4_openITImARkdown")

Without WSL:

A Python script is used to convert the bok files directly (NB: this method seems to create problems on Windows 11):

the database in the bok files is accessed using the pypyodbc library
the metadata is extracted from the main table
the script loops through the bxxxx table and builds the text string:
- the text fragment is deNoised
- footnotes are removed and appended to an endnotes string, with the page number
- page numbers are added to the text fragment
- structural markup is added to the text fragment by a re.sub operation using the tit and lvl fields of the txxxx table
- the text fragment is added to the text string
the metadata and text are then joined and saved to file

NB: the pypyodbc library has trouble converting the Windows-1256 formatted strings in the bok file to utf-8; in order to deal with the encoding problem, the text and endnotes strings are written to a temp file, which is then converted to utf-8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion step 2: BOK to mARkdown

MacOS

BOK > JSON:

JSON > mARkdown:

Windows

Using Windows Subsystem for Linux

bok to json:

json to mARkdown:

Without WSL:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally