Extract roster and logbook data from an AIMS roster
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
test-data
.gitignore
approx_night.py
aws_lambda_processroster.py
extract_freeform.py
extract_roster.py
gpl.txt
process_roster.py
process_roster_27bp.py
readme.md
test_process_roster.py
unittest_results.py

readme.md

Roster Processor

This program is used to take an AIMS roster as input and produce various useful formats as output.

process_roster.py

This module is the core of the program. It processes the AIMS roster through a number of data structure changes.

The AIMS roster itself is currently in the form of HTML4.01 Transitional and fails validation. The format is not well defined and is subject to change without notice, so this program is necessarily somewhat brittle and will need to be updated whenever the underlying roster format is changed.

Converting HTML into “lines”

This step is handled by lines(...) in conjunction with the RosterParser class. The RosterParser class subclasses HTMLParser from the python standard library, which is designed to handle parsing of potentially badly formed HTML. Parsing provided by this class is necessarily extremely basic.

The input of lines(...) is the HTML AIMS roster in the form of a python string. Currently this is formatted in the form of a <table> for each page of the roster containing a <tr> for each line on the page. Each <tr> is then broken down into a large number of <td> blocks to form a grid, the width of each <td> being fixed in the top <tr> on the page and columns of the required width being formed by colspan attributes. Notably, empty rows are formed from a full complement of completely empty <td> elements, and HTMLParser does not call handle_data(...) on completely empty elements.

The output, which I shall refer to as “lines” is a list of lists of the form:

[
    [line0cell0, line0cell1, ...],
    [line1cell1, line1cell2, ...],
    ...
]

where, for example, line10cell15 would represent the 6th non-empty <td> element from the 11th <tr> element in the HTML string. The division into pages is not captured as it does not appear to be useful or relevant for our purposes.

###Converting “lines” into “columns”

Most of the pertinent information is contained in the block of rows on the first page that appears as a table when the HTML is viewed. This table, so far as I can tell, always starts on row 5 of the page 1 row table. Hence the number of columns we require is equal to the number of cells in the 5th line.

The marker for the end of the pertinent rows is the word “Block” appearing as one of the items in the list.

The function columns(...) takes the “lines” format and converts it to “columns” format by identifying the pertinent lists and re-arranging them in the form:

[
    [col0cell0, col0cell1, ...],
    [col1cell0, col1cell1, ...],
    ...
    [col31cell0, col31cell1, ...]
]

where, for example, col5cell6 is the 7th entry down from the 6th column for the left of the pertinent table.

###Converting “columns” into an “event stream”

The next step is to convert the “duty columns” into a single consistent stream of identified, datestamped objects.

event_stream(...) processes the “duty columns” format by working down each column in turn.

When a data item is identified as a time, it is combined with the date that the column represents and pushed to the stream as a standard datetime object.

If it is identified as a pertinent string such as a flight number or airport code, it is pushed to the stream as an Event object which is a combination of a date object and a string. This makes Event objects polymorphic with datetime objects on the date attribute.

If a blank line is found, it is pushed to the stream as a Break object with the type attribute LINE. At the end of each column, a Break object with the type attribute COLUMN is pushed.

Various non-pertinent strings are ignored if found.

Note that objects in the “event stream”, with the exception of ignored strings, have a one to one relationship with entries in the roster. A typical two sector duty in this stream will therefore look something like:

[ ..., BREAK(column), EVENT(flight#), DATETIME(duty start), DATETIME(off blocks),
 EVENT(departure airport), EVENT(arrival airport), DATETIME(on blocks),
 BREAK(line), EVENT(flight#), DATETIME(off blocks), EVENT(departure airport),
 EVENT(arrival airport), DATETIME(on blocks), DATETIME(off duty), BREAK(column), ...]

There will always be a Break object with type attribute COLUMN as the first and last items of an “event stream”.

###Converting the “event stream” into a “duty stream”

The entries in the “event stream” now need to be broken up into duties.

Unfortunately, column breaks are ambiguous. Sometimes they represent a gap between duties, sometimes a gap between sectors and sometimes no gap at all. This ambiguity needs to be resolved, and to do so we need to consider two entries either side of the column break to identify it via context, and this is a somewhat messy process. The rules are:

  1. DATETIME, BREAK(column), EVENT ---> change to BREAK(line)
  2. EVENT, BREAK(column), DATETIME ---> should not occur
  3. DATETIME, BREAK(column), DATETIME ---> remove
  4. EVENT, BREAK(column), EVENT ---> ambiguous so:
    • BREAK(any), EVENT, BREAK(column), EVENT ---> change to BREAK(line)
    • DATETIME, EVENT, BREAK(column), EVENT ---> remove
    • EVENT, EVENT, BREAK(column), EVENT ---> should not occur

Once all the column breaks have been either removed or changed to line breaks, the task is to determine which line breaks represent breaks between duty blocks and which line breaks represent breaks between sectors, standbys within a duty block.

We can first tackle all day duties:

BREAK(line), EVENT, BREAK(line) ---> BREAK(duty), EVENT, BREAK(duty)

With that done, all remaining line breaks should be of the form:

DATETIME, BREAK(line), EVENT, DATETIME

If those two DATETIME objects are more than 8 hours apart, we can safely assume that the line break is in fact a duty break and replace it. All remaining line breaks then represent breaks between items within a duty block.

The duty_stream(...) function takes the “event stream” and carries out all this processing on it. It then breaks up the event stream at the BREAK(duty) entries to give a “duty stream” output. For our two sector duty block it will look like:

[ ... , [ EVENT(flight#), DATETIME(duty start), DATETIME(off blocks),
 EVENT(departure airport), EVENT(arrival airport), DATETIME(on blocks),
 BREAK(line), EVENT(flight#), DATETIME(off blocks), EVENT(departure airport),
 EVENT(arrival airport), DATETIME(on blocks), DATETIME(off duty)],
 ... ]

###Converting “duty stream” to “duty list”

The duty_list(...) function carries out the conversion to the final output format.

Firstly, a duty that spanned midnight of the first day on the roster will result in orphaned entries in the first column. Similarly, a duty that spans midnight on the last day of the roster will only be partially shown. Where possible, these need to be fixed up with fake data in order to process the maximum amount of available information.

The “duty stream” can then be converted to the “duty list” format, which is of the following form:

[
    [ DATE, [DUTYSTART, DUTYEND], [ITEM], [ITEM], ... ],
    [ DATE, [DUTYSTART, DUTYEND], [ITEM], [ITEM], ... ],
    ...
]

where [ITEM] is one of:

  1. A SECTOR of the form [EVENT(flight#), DATETIME(offblocks), EVENT(departure airfield), EVENT(arrival airfield), DATETIME(onblocks)]

  2. A STANDBYLIKE of the form [EVENT(type), DATETIME(start), DATETIME(end)]

  3. An all day duty of the form [EVENT(type)]. In this case DUTYSTART and DUTYEND will both be None.