-
Notifications
You must be signed in to change notification settings - Fork 0
Conversion step 2: BOK to mARkdown
The BOK files are in fact mdb database files. They contain many tables (for an overview, see here), of which the most important are:
-
main: contains the metadata -
bxxxx(in which xxxx is the shamela book id): contains the text of the book -
txxxx(in which xxxx is the shamela book id): contains the table of contents of the book
The easiest way to access the bokfiles in MacOS is to read and convert the bok file to json format using the command-line app jetread.
_04_jetScript_all_tables.sh
#!/bin/bash
# extracting data from MDB files
srcFolder="2_bok"
trgFolder="3_json"
mkdir $trgFolder
cd ./$srcFolder
for mdbfile in *.bok
do
echo $mdbfile
../jetread $mdbfile export -fmt json > ../$trgFolder/$mdbfile.json
done
Use the shamela_converter from the openiti Python library to convert the json file into mARkdown:
from openiti.new_books.convert.shamela_converter import BokJsonConverter
conv = BokJsonConverter()
conv.convert_files_in_folder("./3_json", dest_folder="./4_openITImARkdown")
If WSL is installed, the easiest way to convert the bok files is to convert them to json format first, and then use a Python script to build the text files from these json files.
Use this bash script (requires mdbtools and miller packages;
to install those: run sudo apt-get install mdbtools miller)
Usage: bash ./_03_bok2json.sh ./2_bok ./3_json
Use the shamela_converter from the openiti Python library to convert the json file into mARkdown:
from openiti.new_books.convert.shamela_converter import BokJsonConverter
conv = BokJsonConverter()
conv.convert_files_in_folder("./3_json", dest_folder="./4_openITImARkdown")
A Python script is used to convert the bok files directly (NB: this method seems to create problems on Windows 11):
- the database in the bok files is accessed using the
pypyodbclibrary - the metadata is extracted from the
maintable - the script loops through the
bxxxxtable and builds thetextstring:- the text fragment is deNoised
- footnotes are removed and appended to an
endnotesstring, with the page number - page numbers are added to the text fragment
- structural markup is added to the text fragment by a
re.suboperation using thetitandlvlfields of thetxxxxtable - the text fragment is added to the
textstring
- the metadata and text are then joined and saved to file
NB: the pypyodbc library has trouble converting the Windows-1256 formatted strings in the bok file to utf-8; in order to deal with the encoding problem, the text and endnotes strings are written to a temp file, which is then converted to utf-8