Skip to content

Conversion step 2: BOK to mARkdown

pverkind edited this page Oct 22, 2019 · 3 revisions

The BOK files are in fact mdb database files. They contain many tables (for an overview, see here), of which the most important are:

  • main: contains the metadata
  • bxxxx(in which xxxx is the shamela book id): contains the text of the book
  • txxxx(in which xxxx is the shamela book id): contains the table of contents of the book

MacOS

BOK > JSON:

The easiest way to access the bokfiles in MacOS is to read and convert the bok file to json format using the command-line app jetread.

_04_jetScript_all_tables.sh

#!/bin/bash
# extracting data from MDB files

srcFolder="2_bok"
trgFolder="3_json"

mkdir $trgFolder

cd ./$srcFolder
for mdbfile in *.bok
do
    echo $mdbfile
    ../jetread $mdbfile export -fmt json > ../$trgFolder/$mdbfile.json
done

JSON > mARkdown:

A Python script (_05_to_OpenITI_mARkdown.py) is then used to build the text files from the tables in the json file:

  • first, the entire dictionary is deNoised (i.e., short vowels, kashīdas, etc. are removed)

  • the script loops through all the rows in the bxxxx table and builds a new newBook dictionary (keys: page id, value: page content)

    • gets the page and volume numbers of the page
    • gets the text content of the page
    • removes the footnotes (which are separated from the main text by a number of underscores ________) using the regex \n_+\n
  • the script then loops through the rows in the txxxx table and maps the table of contents to the pages in the newBook dictionary:

    • using the tit (title text) and lvl (title level) fields from the txxxx table, it prefixes the titles in the text (in the newBook dictionary) with the relevant mARkdown tag (### + a number of pipes (|)) using a re.sub() operation

    NB: the tags are appended with an AUTO tag to show that these title tags were inserted automatically; this AUTO tag should later be removed in a manual check. If the automatic insertion failed (e.g., because the title is not mentioned in the text), the title is added at the top of the page and appended with a CHECK tag.

  • the script then extracts all metadata from the main table and converts it to a string (each line prepended with a #META# tag)

  • the book is assembled by joining the pages in the newBook dictionary

  • paragraph tags (#) are prepended to every new line in the text; to make the files more readable in a reader like EditPadPro, line breaks (followed by ~~) are then introduced after every ca. 72 characters

  • finally, the book is assembled by adding together the magicValue ######OpenITI#, the metadata, and the text, and the file is saved to the mARkdown folder

Windows

A Python script is used to convert the bok files directly:

  • the database in the bok files is accessed using the pypyodbc library
  • the metadata is extracted from the main table
  • the script loops through the bxxxx table and builds the text string:
    • the text fragment is deNoised
    • footnotes are removed and appended to an endnotes string, with the page number
    • page numbers are added to the text fragment
    • structural markup is added to the text fragment by a re.sub operation using the tit and lvl fields of the txxxx table
    • the text fragment is added to the text string
  • the metadata and text are then joined and saved to file NB: the pypyodbc library has trouble converting the Windows-1256 formatted strings in the bok file to utf-8; in order to deal with the encoding problem, the text and endnotes strings are written to a temp file, which is then converted to utf-8

Clone this wiki locally