Conversion step 2: BOK to mARkdown

The BOK files are in fact mdb database files. They contain many tables (for an overview, see here), of which the most important are:

main: contains the metadata
bxxxx(in which xxxx is the shamela book id): contains the text of the book
txxxx(in which xxxx is the shamela book id): contains the table of contents of the book

MacOS

BOK > JSON:

The easiest way to access the bokfiles in MacOS is to read and convert the bok file to json format using the command-line app jetread.

_04_jetScript_all_tables.sh

#!/bin/bash
# extracting data from MDB files

srcFolder="2_bok"
trgFolder="3_json"

mkdir $trgFolder

cd ./$srcFolder
for mdbfile in *.bok
do
    echo $mdbfile
    ../jetread $mdbfile export -fmt json > ../$trgFolder/$mdbfile.json
done

JSON > mARkdown:

A Python script (_05_to_OpenITI_mARkdown.py) is then used to build the text files from the tables in the json file:

first, the entire dictionary is deNoised (i.e., short vowels, kashīdas, etc. are removed)
the script loops through all the rows in the bxxxx table and builds a new newBook dictionary (keys: page id, value: page content)
- gets the page and volume numbers of the page
- gets the text content of the page
- removes the footnotes (which are separated from the main text by a number of underscores ________) using the regex \n_+\n
the script then loops through the rows in the txxxx table and maps the table of contents to the pages in the newBook dictionary:
- using the tit (title text) and lvl (title level) fields from the txxxx table, it prefixes the titles in the text (in the newBook dictionary) with the relevant mARkdown tag (### + a number of pipes (|)) using a re.sub() operation
NB: the tags are appended with an AUTO tag to show that these title tags were inserted automatically; this AUTO tag should later be removed in a manual check. If the automatic insertion failed (e.g., because the title is not mentioned in the text), the title is added at the top of the page and appended with a CHECK tag.
the script then extracts all metadata from the main table and converts it to a string (each line prepended with a #META# tag)
the book is assembled by joining the pages in the newBook dictionary
paragraph tags (#) are prepended to every new line in the text; to make the files more readable in a reader like EditPadPro, line breaks (followed by ~~) are then introduced after every ca. 72 characters
finally, the book is assembled by adding together the magicValue ######OpenITI#, the metadata, and the text, and the file is saved to the mARkdown folder

Windows

A Python script is used to convert the bok files directly:

the database in the bok files is accessed using the pypyodbc library
the metadata is extracted from the main table
the script loops through the bxxxx table and builds the text string:
- the text fragment is deNoised
- footnotes are removed and appended to an endnotes string, with the page number
- page numbers are added to the text fragment
- structural markup is added to the text fragment by a re.sub operation using the tit and lvl fields of the txxxx table
- the text fragment is added to the text string
the metadata and text are then joined and saved to file NB: the pypyodbc library has trouble converting the Windows-1256 formatted strings in the bok file to utf-8; in order to deal with the encoding problem, the text and endnotes strings are written to a temp file, which is then converted to utf-8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion step 2: BOK to mARkdown

MacOS

BOK > JSON:

JSON > mARkdown:

Windows

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally