-
Notifications
You must be signed in to change notification settings - Fork 0
Conversion step 2: BOK to mARkdown
The BOK files are in fact mdb database files. They contain many tables (for an overview, see here), of which the most important are:
-
main: contains the metadata -
bxxxx(in which xxxx is the shamela book id): contains the text of the book -
txxxx(in which xxxx is the shamela book id): contains the table of contents of the book
The easiest way to access the bokfiles in MacOS is to read and convert the bok file to json format using the command-line app jetread.
_04_jetScript_all_tables.sh
#!/bin/bash
# extracting data from MDB files
srcFolder="2_bok"
trgFolder="3_json"
mkdir $trgFolder
cd ./$srcFolder
for mdbfile in *.bok
do
echo $mdbfile
../jetread $mdbfile export -fmt json > ../$trgFolder/$mdbfile.json
done
A Python script (_05_to_OpenITI_mARkdown.py) is then used to build the text files from the tables in the json file:
-
first, the entire dictionary is deNoised (i.e., short vowels, kashīdas, etc. are removed)
-
the script loops through all the rows in the
bxxxxtable and builds a newnewBookdictionary (keys: page id, value: page content)- gets the page and volume numbers of the page
- gets the text content of the page
- removes the footnotes (which are separated from the main text by a number of underscores ________) using the regex
\n_+\n
-
the script then loops through the rows in the
txxxxtable and maps the table of contents to the pages in thenewBookdictionary:- using the
tit(title text) andlvl(title level) fields from thetxxxxtable, it prefixes the titles in the text (in thenewBookdictionary) with the relevant mARkdown tag (###+ a number of pipes (|)) using are.sub()operation
NB: the tags are appended with an
AUTOtag to show that these title tags were inserted automatically; thisAUTOtag should later be removed in a manual check. If the automatic insertion failed (e.g., because the title is not mentioned in the text), the title is added at the top of the page and appended with aCHECKtag. - using the
-
the script then extracts all metadata from the
maintable and converts it to a string (each line prepended with a#META#tag) -
the book is assembled by joining the pages in the
newBookdictionary -
paragraph tags (#) are prepended to every new line in the text; to make the files more readable in a reader like EditPadPro, line breaks (followed by
~~) are then introduced after every ca. 72 characters -
finally, the book is assembled by adding together the magicValue ######OpenITI#, the metadata, and the text, and the file is saved to the mARkdown folder
A Python script is used to convert the bok files directly:
- the database in the bok files is accessed using the
pypyodbclibrary - the metadata is extracted from the
maintable - the script loops through the
bxxxxtable and builds thetextstring:- the text fragment is deNoised
- footnotes are removed and appended to an
endnotesstring, with the page number - page numbers are added to the text fragment
- structural markup is added to the text fragment by a
re.suboperation using thetitandlvlfields of thetxxxxtable - the text fragment is added to the
textstring
- the metadata and text are then joined and saved to file
NB: the
pypyodbclibrary has trouble converting the Windows-1256 formatted strings in the bok file to utf-8; in order to deal with the encoding problem, thetextandendnotesstrings are written to a temp file, which is then converted to utf-8