Skip to content
Clara edited this page Apr 5, 2021 · 61 revisions

Table of Contents

Introduction

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.

Getting Started

Setting up

Prerequisites

JDK

  1. Visit the Java SE Downloads page, and download the most recent, platform-appropriate binary for JDK 6.
  2. Install the JDK from the downloaded executable file. Note: If you are working in Windows, you will need to set the PATH environment variable manually.

git

  1. Visit the git page and download the most recent installer.
  2. Run the installer.
  3. If you've never used git before, the Pro Git book is an excellent guide. However, for the impatient, we recommend this section on setup and any git quickstart guide (this one is OK).

Maven

  1. Download the most recent stable version of Maven from the Apache Maven project page.
  2. Run the installer. Additional instructions can be found at the bottom of the project page.

Note: We use a standalone install of Maven because the Eclipse plugin version often runs into annoying build bugs. Pointing the plugin to a standalone installation has proven to be much more reliable.

Eclipse + plugins

  1. Download Eclipse IDE for Java Developers (Indigo or newer) from the Eclipse Download page. (Really, any version that has the Maven plugin already installed)
  2. Install Eclipse by extracting the downloaded package into a directory of your choosing.
  3. See the tips page for instructions on installing plugins.

UIMA SDK

Follow the instructions under "Install UIMA SDK" at the Apache UIMA page. See the Eclipse tips page for instructions on installing the UIMA plugins.

Getting the Code for the Tutorial Project

Run git clone https://github.com/oaqa/oaqa-tutorial.git in your preferred project directory.

Open Eclipse and specify the same project directory as your workspace. Once Eclipse is open, go to the workbench and File > Import > General/Existing Projects into Workspace. Set the root directory as your workspace directory. The tutorial project should show up in the "Projects" box, make sure its checkbox is checked and click Finish.

(This may also need to contain a section on the basics of Git, project information in a location on GitHub, and possibly information on Maven as well, things like Group Id, Artifact ID, Version. Depends on what method above is used.)

Directory structure

Note: need to create generic archetype for CSE project separate from hw1. Sub-Note: what is hw1?

The directory structure should like this: Note: the directory structure of what? What is meant by "all your descriptors"?

myproject
 |- pom.xml
 '- src
    '- main
      |- java
      |  '- **/*.java 
      '- resources
         |- mypipeline.yaml /* the entry point for your pipeline */
         |- **/*.yaml /* all your descriptors go into the resources folder */
         '- META-INF
            '- org.uimafit
               '- types.txt

(Optional) Persistence/Database

Definitely a good idea to at least define OAQA database, what it is, how it is used.

(Optional) UI

Writing Extended Configuration Descriptors (ECD)

Extended configuration descriptor (ECD) extends the UIMA and uimaFit to balance between ease-of-use of flexibility. To this end, ECD provides three major features: (1) YAML based descriptors, (2) driver that supports multiple options for a component specified in the descriptor, and (3) driver that supports declarative options for a component.

Similar to collection processing engine descriptor (CPE) in UIMA SDK, the key element of ECD is a YAML associative array of three elements:

  • collection-reader
  • the actual pipeline, and
  • the optional post-process pipeline

Similar to the UIMA SDK component descriptor (collection reader,pipeline annotator, and cas consumer), the building blocks of ECD are component descriptors (a YAML associative array) defines:

  • The Java class that implements the component,
  • Parameters and values the component requires.

Basics

YAML Format

YAML Ain't Markup Language (YAML) is a human-readable data serialization format that provides a simple but rich syntax to represent a data structure, like our component descriptors as well as engine descriptors.

One important aspect of YAML is that indentation matters. Each indentation after a newline forms a block literal, similar to other indentation-sensitive languages such as Python. Also, parameters can be passed to fields using a colon and space. So for example,

configuration: 
  name: oaqa-tutorial
  author: oaqa

Here the parameters name, and author are set to oaqa-tutorial and oaqa respectively, and are passed as a block to configuration.

Components

When writing an ECD for a UIMA annotator, the first line specifies whether it is a subclass of another annotator descriptor, or whether it refers to a UIMA annotator class directly. The first line in an ECD component can contain one of the following:

  • inherit: will look for a resource file within the class-path on the path specified by the doted syntax a.b.c => a/b/c.yaml. Inherited parameters. can be overridden directly on the body of the resource file.

  • class: will look for the specified class within the class-path, and is intended to be used as a shorthand for classes that do not have configurable parameters.

For example, if bar.yaml refers to a concrete class Bar.java that resides in the package bar with some fixed parameter fixed-param set to a, then the YAML descriptor will look like:

# bar.yaml
class: bar.Bar
fixed-param: a

If a descriptor foo.yaml is a subclass of the bar.yaml, the YAML descriptor will look like:

# foo.yaml
inherit: bar
var: [x, y]

![resources][resources]

Parameters

Resources on a descriptor are configured using named parameters; any dash-separated string is a valid parameter name except for the reserved keywords: inherit, options, class and pipeline. The actual value of the parameter is either a Java primitive wrapper: Integer, Float, Long, Double, Boolean, or a String. For nested resources compound parameters are passed as Strings that are further parsed within the resource. For example, passing a RegEx pattern as a String parameter to a RoomAnnotator will look like:

class: annotators.RoomAnnotator

pattern: \\b[0-4]\\d[0-2]\\d\\d\\b

Cross-opts:

Combinatorial parameters are specified using the cross-opts mapping and declaring the desired values as elements on a list. For example for the following annotator descriptor:

class: annotators.RoomAnnotator

foo: bar

cross-opts:
    parameter-a: [value100,value200]
    parameter-b: [value300,value400] 

The configuration on Listing 2 will result in the 2x2 cross-product of configurations of the component:

[foo:bar, parameter-a: value100, parameter-b: value300] 
[foo:bar, parameter-a: value200, parameter-b: value300]
[foo:bar, parameter-a: value100, parameter-b: value400] 
[foo:bar, parameter-a: value200, parameter-b: value400]

CSE pipeline descriptors

Configuration

Phases

Pipeline

Options

![In-phase pipeline][inphasepipeline]

![Execution path][executionpathinphasepipeline]

(Optional) Post-processing

Creating a type system

Type system defines the domain: what is the subject matter being processed, and what kind of annotations are being produced, in a pipeline.

Pipeline code

Writing collection reader code

Writing annotator code

Writing cas consumer code

Examples

These examples are based on the UIMA SDK tutorial. To see how they were migrated see section (Migrating your existing UIMA pipeline to CSE (and here)).

Example 1 (simple RoomNumberAnnotator)

For this example we will only use one type: RoomNumber. You can use the existing TutorialTypeSystem.xml under src/main/resources/types/.

<?xml version="1.0" encoding="UTF-8" ?>
  <typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
    <name>TutorialTypeSystem</name>
    <description>Type System Definition for the tutorial examples - 
        as of Exercise 1</description>
    <vendor>Apache Software Foundation</vendor>
    <version>1.0</version>
    <types>
      <typeDescription>
        <name>org.apache.uima.tutorial.RoomNumber</name>
        <description></description>
        <supertypeName>uima.tcas.Annotation</supertypeName>
        <features>
          <featureDescription>
            <name>building</name>
            <description>Building containing this room</description>
            <rangeTypeName>uima.cas.String</rangeTypeName>
          </featureDescription>
        </features>
      </typeDescription>
    </types>
  </typeSystemDescription>

Simply include the TutorialTypeSystem.xml file to you META-INF/org.uimafit/types.txt file. Your types.txt file should look like:

classpath*:types/TutorialTypeSystem.xml
classpath*:types/SourceDocumentInformation.xml

Next, we will write the main yaml descriptor for the example. Notice that we are using the same collection-reader and cas consumer from before only we added the phase RoomNumberAnnotator. (src/main/resources/META-INF/oaqa-tutorial-ex1.yaml)

configuration: 
  name: oaqa-tutorial
  author: oaqa

collection-reader:
  inherit: collection_reader.filesystem-collection-reader
  InputDirectory: data/
pipeline:
  - inherit: ecd.phase  
    name: RoomNumberAnnotator
    options: |
      - inherit: tutorial.ex1.RoomNumberAnnotator 

  - inherit: cas_consumer.AnnotationPrinter

Now we define the descriptor for the RoomNumberAnnotator by just specifying where the class is located. (src/main/resources/tutorial.ex1/RoomNumberAnnotator.yaml)

class: org.apache.uima.tutorial.ex1.RoomNumberAnnotator

Finally, we write the code for the annotator, taken directly from the UIMA SDK tutorial (link). (src/main/java/org.apache.uima.tutorial.ex1/RoomNumberAnnotator.java)

package org.apache.uima.tutorial.ex1;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.jcas.JCas;
import org.apache.uima.tutorial.RoomNumber;

/**
 * Example annotator that detects room numbers using Java 1.4 regular expressions.
 */
public class RoomNumberAnnotator extends JCasAnnotator_ImplBase {
  private Pattern mYorktownPattern = Pattern.compile("\\b[0-4]\\d-[0-2]\\d\\d\\b");
  private Pattern mHawthornePattern = Pattern.compile("\\b[JG1-4][1-2NS]-[A-Z]\\d\\d\\b");
  /**
   * @see JCasAnnotator_ImplBase#process(JCas)
   */
  public void process(JCas aJCas) {
    // get document text
    String docText = aJCas.getDocumentText();
    // search for Yorktown room numbers
    Matcher matcher = mYorktownPattern.matcher(docText);
    while (matcher.find()) {
      // found one - create annotation
      RoomNumber annotation = new RoomNumber(aJCas);
      annotation.setBegin(matcher.start());
      annotation.setEnd(matcher.end());
      annotation.setBuilding("Yorktown");
      annotation.addToIndexes();
    }
    // search for Hawthorne room numbers
    matcher = mHawthornePattern.matcher(docText);
    while (matcher.find()) {
      // found one - create annotation
      RoomNumber annotation = new RoomNumber(aJCas);
      annotation.setBegin(matcher.start());
      annotation.setEnd(matcher.end());
      annotation.setBuilding("Hawthorne");
      annotation.addToIndexes();
    }
  }
}

To run simply launch ex1.launch by right-clicking it and selecting Run As...-> ex1.

You should now be getting this output:

Example 2 (passing parameters to an annotator)

#oaqa-tutorial-ex2.yaml
configuration: 
  name: oaqa-tutorial
  author: oaqa

collection-reader:
  inherit: collection_reader.fs-collection-reader
  file: /data/UIMA_Seminars.txt
pipeline:
  - inherit: ecd.phase  
    name: RoomNumberAnnotator
    options: |
      - inherit: tutorial.ex2.RoomNumberAnnotator
  - inherit: cas_consumer.XmiWriterCasConsumer
package org.apache.uima.tutorial.ex2;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.AnalysisComponent;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.tutorial.RoomNumber;
import org.apache.uima.util.Level;

public class RoomNumberAnnotator extends JCasAnnotator_ImplBase {
  private Pattern[] mPatterns;
  private String[] mLocations;
 
  public void initialize(UimaContext aContext) throws ResourceInitializationException {
    super.initialize(aContext);
    // Get config. parameter values from oaqa-tutorial-ex2.yaml
    String[] patternStrings = (String[]) aContext.getConfigParameterValue("Patterns"); 
    mLocations = (String[]) aContext.getConfigParameterValue("Locations");

    // compile regular expressions
    mPatterns = new Pattern[patternStrings.length];
    for (int i = 0; i < patternStrings.length; i++) {
      mPatterns[i] = Pattern.compile(patternStrings[i]);
    }
  }

  /**
   * @see JCasAnnotator_ImplBase#process(JCas)
   */
  public void process(JCas aJCas) throws AnalysisEngineProcessException {
    // get document text
    String docText = aJCas.getDocumentText();
    // loop over patterns
    for (int i = 0; i < mPatterns.length; i++) {
      Matcher matcher = mPatterns[i].matcher(docText);
      while (matcher.find()) {
        // found one - create annotation
        RoomNumber annotation = new RoomNumber(aJCas);
        annotation.setBegin(matcher.start());
        annotation.setEnd(matcher.end());
        annotation.setBuilding(mLocations[i]);    
        annotation.addToIndexes();
        getContext().getLogger().log(Level.FINEST, "Found: " + annotation);
      }
    }
  }
}

To run simply launch ex2.launch by right-clicking it and selecting Run As...-> ex2.

You should now be getting this output:

Example (UIMA-ASified RemoteRoomNumberAnnotator)

This example will show you how to run UIMA-AS annotators from within your CSE pipeline. The example will consist of two parts: (1) server-side setup, and (2) client-side setup. Since the server-side configuration is not currently supported by CSE, we will use the standard UIMA-AS SDK. You can read more about the purpose and usage of UIMA-AS here.

Server-side

  1. Download the UIMA AS Asynchronous Scaleout from here
  2. Decompress into desired directory where you want the remote service to run on.
  3. cd into apache-uima-as-{version}
  4. Run export UIMA_HOME=/path/to/directory/apache-uima-as-{version}/
  5. Run ./bin/startBroker.sh

This will output:

.
.
.
INFO  BrokerService                  - ActiveMQ 5.4.1 JMS Message Broker (localhost) is starting
INFO  BrokerService                  - For help or more information please see: http://activemq.apache.org/
INFO  ManagementContext              - JMX consoles can connect to service:jmx:rmi:///jndi/rmi://localhost:1099/jmxrmi
* INFO  TransportServerThreadSupport   - Listening for connections at: tcp://your-hostname:61616 *
.
.

Look for the fourth line with TransportServerThreadSupport and copy the tcp address, in our case: tcp://your-hostname:61616; you will use it in later steps. 6. In our example we will run the example RoomNumberAnnotator from the UIMA-AS tutorial, but these instructions will apply to other annotators as well. As with any annotator in the previous examples, the UIMA-AS annotator will be composed of an annotator descriptor and a Java class implementation. However, in addition to these the the UIMA-AS annotator will require an additional deployment descriptor.

Since currently CSE does not support server-side UIMA-AS annotators, the annotator descriptor will not be a CSE-style yaml file, but a "vanilla" UIMA XML descriptor provided in the UIMA-AS SDK. It will look like:

<!-- examples/descriptors/tutorial/ex2/RoomNumberAnnotator.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier" xmlns:xi="http://www.w3.org/2001/XInclude">
	<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
	<primitive>true</primitive>
	<annotatorImplementationName>org.apache.uima.tutorial.ex2.RoomNumberAnnotator</annotatorImplementationName>
	<analysisEngineMetaData>
		<name>Room Number Annotator</name>
		<description>An example annotator that searches for room numbers in the IBM Watson
			research buildings.</description>
		<version>1.0</version>
		<vendor>The Apache Software Foundation</vendor>
		<configurationParameters>
			<configurationParameter>
				<name>Patterns</name>
				<description>List of room number regular expression
					pattterns.</description>
				<type>String</type>
				<multiValued>true</multiValued>
				<mandatory>true</mandatory>
			</configurationParameter>
			<configurationParameter>
				<name>Locations</name>
				<description>List of locations corresponding to the room number
					expressions specified by the Patterns parameter.</description>
				<type>String</type>
				<multiValued>true</multiValued>
				<mandatory>true</mandatory>
			</configurationParameter>
		</configurationParameters>
		<configurationParameterSettings>
			<nameValuePair>
				<name>Patterns</name>
				<value>
					<array>
						<string>\b[0-4]\d-[0-2]\d\d\b</string>
						<string>\b[G1-4][NS]-[A-Z]\d\d\b</string>
						<string>\bJ[12]-[A-Z]\d\d\b</string>
					</array>
				</value>
			</nameValuePair>
			<nameValuePair>
				<name>Locations</name>
				<value>
					<array>
						<string>Watson - Yorktown</string>
						<string>Watson - Hawthorne I</string>
						<string>Watson - Hawthorne II</string>
					</array>
				</value>
			</nameValuePair>
		</configurationParameterSettings>
		<typeSystemDescription>
			<imports>
				<import location="../ex1/TutorialTypeSystem.xml"/>
			</imports>
		</typeSystemDescription>
		<capabilities>
			<capability>
				<inputs></inputs>
				<outputs>
					<type>org.apache.uima.tutorial.RoomNumber</type>
					<feature>org.apache.uima.tutorial.RoomNumber:building</feature>
				</outputs>
				<languagesSupported></languagesSupported>
			</capability>
		</capabilities>
		<operationalProperties>
			<modifiesCas>true</modifiesCas>
			<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
			<outputsNewCASes>false</outputsNewCASes>
		</operationalProperties>
	</analysisEngineMetaData>
</analysisEngineDescription>

The code for the annotator will be exactly the same as the one from Example 2 and can also be found under ./examples/src/org/apache/uima/tutorial/ex2/RoomNumberAnnotator.java.

Finally, the deployment descriptor tells the broker from step 5 how to run the annotator:

<!-- ./examples/deploy/as/Deploy_RoomNumberAnnotator.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDeploymentDescription 
  xmlns="http://uima.apache.org/resourceSpecifier">
  
  <name>Room Number Annotator</name>
  <description>Deploys the Room Number Annotator Primitive AE</description>
  
  <deployment protocol="jms" provider="activemq">
    <service>
      <inputQueue endpoint="RoomNumberAnnotatorQueue" brokerURL="tcp://your-hostname:61616"/>
      <topDescriptor>
       <import location="../../descriptors/tutorial/ex2/RoomNumberAnnotator.xml"/> 
      </topDescriptor>
    </service>
  </deployment>
</analysisEngineDeploymentDescription>

Notes: * The inputQueue endpoint tag contains the endpoint that will be used by the client-side wrapper to connect with the annotator. * The location is the path to the annotator descriptor.

  1. Run ./bin/deployAsyncService.sh ./examples/deploy/as/Deploy_RoomNumberAnnotator.xml to launch the RoomNumberAnnotator as a remote UIMA-AS service. It should now display this output:
Service:Room Number Annotator Initialized. Ready To Process Messages From Queue:RoomNumberAnnotatorQueue
Press 'q'+'Enter' to quiesce and stop the service or 's'+'Enter' to stop it now.
Note: selected option is not echoed on the console.

The broker is now listening on the port for requests for the RoomNumberAnnotator from the client-side on the RoomNumberAnnotatorQueue endpoint defined in the deployment descriptor from step 6.

Client-side:

For the client-side we can use the same pipeline descriptor from Example 2, only changing the annotator from RoomNumberAnnotator to RemoteRoomNumberAnnotator:

#./src/main/resources/oaqa-tutorial-ex2-remote.yaml
configuration: 
  name: oaqa-tutorial
  author: oaqa
collection-reader:
  inherit: collection_reader.filesystem-collection-reader
  InputDirectory: data/
  
pipeline:
  - inherit: ecd.phase  
    name: RemoteRoomNumberAnnotator
    options: |
      - inherit: tutorial.ex2.RemoteRoomNumberAnnotator 

  - inherit: cas_consumer.AnnotationPrinter

Now, we write the RemoteRoomNumberAnnotator.yaml as follows:

class: edu.cmu.lti.oaqa.ecd.phase.adapter.JMSAdapterWrapper
brokerUrl: tcp://your-hostname:61616
endpoint: RoomNumberAnnotatorQueue
timeout: 5000
getmetatimeout: 5000
cpctimeout: 5000

Notes:

  • The class is the path to the wrapper included in uima-ecd. It will be used for all client-side UIMA-AS descriptors.
  • The brokerUrl is the address you copied from step 5.
  • The endpoint is the endpoint defined in the server-side setup (step 6).
  • Various timeouts in milliseconds (will add more...)

Now you should be able to run the complete pipeline by running launching/ex2-remote.launch in Eclipse. The output should be identical to that in Example 2.

Migrating your existing UIMA pipeline to CSE

  • Have to use UIMA-fit style of declaring type system in META-INF/org.uimafit/types.txt (see slides)

  • Have to use UIMA-fit hierarchy of cas_consumerimpl in order for it to work in the pipeline

  • initialize() -> initialize(UimaContext context)

    • initialize takes argument UimaContext context
  • instead of processCas() -> process(CAS aCas)

  • Collection reader has to extend AbstractCollectionReader

    • can not use out-of-the-box UIMA collection readers
    • implement method getNextElement
      • returns the next data element in the pipeline
      • This is instead of using getNext(jcas)
    • Override and call super of initialize

FAQ

Acknowledgements

Thanks to Zi Yang (@ziy), Elmer Garduno (@elmerg), and OAQA team members!

TODO

  1. Take UIMA GUI screenshots of the output of each example.
  2. How to create your own launch configuration using a yaml descriptor (ECD-Driver).

References

[1] (Session 12: Configuration Space Exploration (SE class)) - (SE)
[2] (Elmer's technical report) - (E)
[3] (Zi's Slides) - (Z)
[4] Zi HW0 - (hw0)
[5] Zi HW1 - (hw1)
[6] Zi HW2 - (hw2)
[7] UIMA SDK Tutorial (USDK)
[resources]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/resources.png "resources" [inphasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/inphasepipeline.png "inphasepipeline" [executionpathinphasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/executionpathinphasepipeline.png "executionpathinphasepipeline" [threephasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/threephasepipeline.png "threephasepipeline"