Skip to content
Christopher Hall edited this page Oct 15, 2010 · 9 revisions

Command sequence to build an image

The sequence of commands to build an image involves obtaining a wiki XML dump file and uncompressing it if necessary. To start try using one of the sample files, three source XML files are normally required by this process, they are:

  1. An XML file containing just one article describing the license for the data.

    • The file for Wikipedia data is in: XML-Licenses/en/license.xml
  2. An XML file containing just one article describing the terms of use for the data.

    • The file for Wikipedia data is in: XML-Licenses/en/terms.xml
  3. The XML file containing all the articles required.

    • There is a sample in: xml-file-samples/classical_composers.xml

These data files are now process through four steps to produce the the set of data and index files:

  1. Index to determine all article titles, detect redirected article and build the search index files (.FND, .PFX).

  2. Parse to convert the individual article XML data to HTML using a slightly modified version of the MediaWiki PHP code

  3. Render the HTML data to produce the rendered data (.DAT} and offset files (.idx-tmp). The offset files do not end up on the SD Card.

  4. Convert the offset files to an article index file (.IDX)

Here is the command to perform these operations

$ make XML_FILES='XML-Licenses/en/license.xml XML-Licenses/en/terms.xml xml-file-samples/classical_composers.xml'\
    WORKDIR=work DESTDIR=image WIKI_LANGUAGE=en cleandirs createdirs iprc

Note that iprc in the above command is the same as index parse render combine

To complete the image it is necessary to add the program, fonts and other items using the command:

$ make DESTDIR=image install

To check that this image works the simulator can be run using:

make DESTDIR=image sim4

After trying the simulator the clear the history before transferring the data to the SD Card, the second command erases the existing SD Card contents

$ make DESTDIR=image clear-history
$ rm -r /media/MY-CARD
$ ./scripts/MakeSD --image=image --card=/media/MY-CARD en

The MakeSD script can accept either two letter language codes like fr to mean frpedia or the subdirectory name in full such as enquote. This script is just as a convenience for when the image directory has more items than would fit onto a single card.

Building with multiple machines

The above process will produce just one large data file, which can exceed the maximum file size for the FAT formatted SD Card; also the time taken to run the process could be days or weeks depending on the CPU and disk speed of the machine.

The Makefile has options to use multiple machines and multiple cores or threads, but this is awkward to use so there are some scripts to simplify this process. These are:

  • scripts/Run the primary script that runs on each machine to parameters to this script identify how many machines are in the set, the number of threads available and which machine of the set it is. The remainder of the command specifies which languages to process.

  • scripts/progress an interface script to allow starting stopping and monitoring of the the set of running servers

  • scripts/RenderDo allows running a command on one of the servers. Assuming that ssh public key access has been set up and that a screen process is running on each server.

  • scripts/StartScreen used by the progress script to start a screen process on the servers.

  • scripts/Copy use by progress copy function to ensure that one of the servers has the complete set of files so that the combine operation as described in the previous section can be performed.

  • scripts/AnalyseLog this is run on each machine by progress to extract summary information from log files created by the running servers. The log files are confusing to read since there are many parallel makes running on each server writing to a single log files. This script formats this in a readable form and gives an estimated completion time.

Notes:

The Run script make some assumptions about the location and naming of the data files. It assumes the XML-licenses/${language} will contain license.xml and terms.xml and that for each language the is a symlink to the actual data file

e.g. for German Wikipedia:

dewiki-pages-articles.xml -> ../dewiki-20100925-pages-articles.xml

Note that the date in the actual data file name is extracted and used to create WIKI.FTR

In this case the "current directory" is the checked out wikireader directory.

configuration of the servers

Each server has the following characteristics:

  • hostnames that differ only in their numeric suffix e.g. render1, render2, ...
  • the same user (currently wr) with ssh public key access form a separate control machine.
  • one of the machines designated as the combine_host this accumulates a full set of files so may need more disk space.
  • ssh public key access from all machines to the combine_host
  • each machine installed with Ubuntu server plus all the relevant packages i.e. make requirements must not produce missing packages errors.
  • each having a copy of the XML files and the symlinks installed (e.g. progress -Len -x -Lja -x -Len:quote -x)
  • screen is running (progress -N)

The configuration is in the progress script in a function called set_farm(). Here is one of the configurations:

    combine_host="3"                           # suffix number of the host to to the combine process
    render_host='--host=simul3'                # the "maximum host"  i.e. specifies simul1, simul2 and simul3
    run_parallel="--machines=3 --parallel=12"  # there are three machines and each machine has 12 threads (actually 6 core + HT)

some commands

$ progress -p                              # make sure nothing is running by looking at process list
$ progress -r fr,de,en:quote               # queue up frpedia depedia and en-quotes
$ progress -Lfr -a                         # check progress
$ progress -Lde -a -Len:quote -a           # check progress
$ progress -p                              # make sure nothing is running by looking at process list
$ progress -Lfr -c -G -Lde -c -G           # combine and retrieve some files
$ make WIKI_LANGUAGE=de install            # add fonts etc.
$ make WIKI_LANGUAGE=de sim4               # run simulator to test