Skip to content

6x Presentation

Maxim Poliakovski edited this page Jun 6, 2019 · 2 revisions

6x Presentation

Glyph Segmentation

One of the major tasks of an OMR engine is the recognition of musical symbols: which kind they are, where they are located, what their precise dimensions are, etc.

Within the whole population of musical symbols, we make an important distinction between varying-shape and fixed-shape symbols, because their recognition techniques will differ:

  • A varying-shape symbol, such as beam, slur, stem, text, etc, has no fixed shape or size.
    Each kind of those symbols requires ad-hoc strategies and techniques, therefore each kind is implemented by a separate specific step in Audiveris engine.
  • A fixed-shape symbol, such as clef, note head, rest, accidental sign, flag, etc, once normalized with respect to the scale of the underlying staff, is rather constant in shape and size.
    The existence of different musical fonts can of course bring some variation but, with proper training, this generally remains manageable.
    There are about 120 different fixed-shape symbols, it is therefore crucial to find out an efficient means to address this whole population.

Audiveris 5.x still uses from 4.x versions what is called a "glyph classifier". Such classifier tries to infer the corresponding musical symbol, if any, from a glyph input (a glyph is a set of foreground pixels, likely to represent a musical symbol).
The current glyph classifier, when fed with a glyph, computes a vector of about 110 features composed of geometric moments and ART moments (Angular Radial Transform, as used in Mpeg-7). Then, a shallow neural network (with just one hidden layer!) is rather efficient in mapping the 110 moments to 120 musical shapes.

The main drawback of a glyph classifier is its key dependency on glyph segmentation, which can be defined as the process of choosing which pixels should compose a glyph. On one side, symbols often overlap with staff lines or with other symbols while, on the other side, poor quality scores often break symbols into pieces.

  • Staff lines retrieval is rather easy, but their removal is not. A staff line pixel may well belong also to a crossing symbol (which has not been retrieved yet, a "chicken & egg" situation typical in OMR). So staff line removal often degrades glyph candidates.
  • In 5.x, to better cope with compact blubs of note heads, these specific head shapes were removed from the glyph classifier scope and processed via template matching. But template matching brought its own problems, beginning with dependency on a particular font.
  • In the opposite, the building of compound glyphs from smaller pieces quickly hits computational complexity: a set of just 10 pieces in a neighborhood leads to 2**10 (1024) combinations to test.

In short, segmentation is an endless fight.

Location Context

Trying to recognize a symbol in isolation may not be the right approach, especially when facing poor quality scores.

Even a human reader, consciously or not, takes the "symbol neighborhood" into account to decipher a given area in the score image.

Adding to that, even if a symbol shape is clear from a pure graphical point of view, its precise interpretation as a musical symbol may depend on the location context.
Let's consider the (extreme) case of a dot. Graphically, a dot is a dot, but as a musical symbol it could be:

  • a part of a repeat sign (upper or lower dot), close to a barline on its left or right side,
  • a staccato sign, close to a note head above or below,
  • an augmentation dot (first or second dot), close to a note head or rest (or to a first dot) on its left side,
  • a part of a fermata sign, close to a fermata arc just above or below,
  • a dot of an ending indication, close to an ending text or number on its left side,
  • a simple text dot,
  • or just some stain...

Audiveris 5.x tries to take into account this contextual information. The notion of SIG (Symbol Interpretation Graph) formalizes potential relations between a candidate interpretation and the other surrounding interpretations, resulting in the computation of a contextual grade value for the candidate. Note here again the "chicken & egg" situation between all candidates.

The main drawback of this analytical approach is the resulting code complexity, since each of these musical "rules" must be explicitly coded and tuned by hand. This must represent a significant part in the 200 000 lines of code in Audiveris 5.1.

New Classifiers

Recent results in machine learning have shown that a deep neural network, when properly trained on very large representative data sets, can beat former sophisticated software on a wide variety of tasks.

Back in summer 2016, we were beginning to play with Deeplearning4J when ZHAW (Zurich University of Applied Sciences) got in touch with us. They proposed to use Audiveris as a platform for experiments using Deep Learning for recognition and classification.

The 6.0 prototype, with its 2 kinds of classifier, derives directly from this ZHAW/Audiveris collaboration.

Patch Classifier

We assume we have scaled the input image at interline 10. This means that the vertical distance from one staff line to the next is 10 pixels. The chosen value is a bit arbitrary, but must be the same at training and inference times.

Given a selected (x, y) location, the classifier input is composed of all the pixels of a sub-image (the "patch"), centered on the provided location.

The patch dimension has been chosen rather arbitrarily as (width:40, height:160). This represents the equivalent of 1 staff-height in width and 4 staff-heights in height, enough for the neural network to "grab" relevant context data.

The classifier output is a vector of probabilities indexed by symbol shape.

The picture above presents the patch classifier in action:

  • On the left, you can see the patch that was submitted to the classifier.
    (the diagonal red lines are not part of the input, it's just a visual means to precisely indicate the patch center location).
  • On the right, the resulting top shapes are presented by decreasing probability.

Notice that, even though we had selected the center of a sharp "glyph", the classifier was able to assign a very high value at keySharp shape and a very low value at accidentalSharp shape. This is a good example of the "context effect".

Page Classifier

The input for the page classifier is the whole image, properly scaled at interline 10.

On this input, the page classifier performs:

  • Detection: of symbols
  • Segmentation: which pixels belong to the symbol
  • Classification: assign the detected symbol to the correct class (shape)

Strictly speaking, the page classifier is thus more than a "simple" classifier.

The output is a collection of "annotations", where an annotation represents a detected symbol, composed of:

  • Symbol shape,
  • Bounding box in image,
  • Confidence (a value in 0..1 range).

The picture above presents a few annotations drawn on input image.

  • The symbol rectangular bounds are shown in green,
  • The shape name is written in magenta, just above the symbol box,
  • Confidence is not shown because, as of this writing, it is always set to the 1.0 constant value.