-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Oozie Maven Plugin (OMP) was created to simplify the creation of packages (tar.gz files) containing workflow definition of Apache Oozie as well as all other files needed to run a workflow (configuration files, libraries, etc.). In addition, this plugin enables workflow's reusability -- generated packages can be uploaded to a Maven's repository, added as a dependency to other workflows and reused as subworkflows. The plugin defines a new type of Maven's artifact called oozie, but it uses standard build lifecycles.
The sources of OMP are available at https://github.com/CeON/oozie-maven-plugin.
The binaries are available in the Maven's repository of ICM. To use it, you need to add the following sections in your pom.xml:
<build>
<plugins>
<plugin>
<groupId>pl.edu.icm.maven</groupid>
<artifactId>oozie-maven-plugin</artifactid>
<version>current_version_number</version>
<extensions>true</extensions>
</plugin>
</plugins>
</build>
and
<pluginRepositories>
<pluginRepository>
<id>yadda</id>
<name>YADDA project repository</name>
<url>http://maven.icm.edu.pl/artifactory/repo</url>
</pluginrepository>
</pluginrepositories>
Minimal project that uses OMP needs to contain the following files:
- pom.xml
<groupId>my-project-groupId</groupid>
<artifactId>my-project-artifactId</artifactid>
<version>VERSION_NUMBER</version>
<packaging>oozie</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceencoding>
</properties>
<build>
<plugins>
<plugin>
<groupId>pl.edu.icm.maven</groupid>
<artifactId>oozie-maven-plugin</artifactid>
<version>1.1</version>
<extensions>true</extensions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>pl.edu.icm.oozie</groupid>
<artifactId>oozie-runner</artifactid>
<version>1.2-SNAPSHOT</version>
<scope>test</scope>
</dependency>
</dependencies>
<repositories>
<repository>
<id>yadda</id>
<name>YADDA project repository</name>
<url>http://maven.icm.edu.pl/artifactory/repo</url>
</repository>
</repositories>
<pluginRepositories>
<pluginRepository>
<id>yadda</id>
<name>YADDA project repository</name>
<url>http://maven.icm.edu.pl/artifactory/repo</url>
</pluginrepository>
</pluginrepositories>
</project>
- src/main/oozie/workflow.xml
mvn archetype:generate -DarchetypeArtifactId=oozie-maven-archetype \ -DarchetypeGroupId=pl.edu.icm.maven.archetypes -DarchetypeVersion=1.0-SNAPSHOT \ -DinteractiveMode=false -DgroupId=my-project-groupId -DartifactId=my-project-artifactId \ -Dversion=VERSION_NUMBER -DarchetypeRepository=http://maven.icm.edu.pl/artifactory/repo
You can build the project by calling
mvn install
This call creates the package that can be uploaded to a Maven's repository and used in other projects. That package does not contain any dependencies (subworkflows, libraries), that should also be stored in the Maven's repository. The file created in this procedure is named artifactId-version-oozie-wf.tar.gz.
If you want to build a package intended for run on an Oozie server, you need to call
mvn install -DjobPackage
The file created (artifactId-version-oozie-job.tar.gz) contains everything that is necessary to run a given workflow.
Oozie Maven Plugin (OMP) supports scripts written in PigLatin.
OMP allows to use Pig's scripts from dependent modules. Such a module (containing Java classes such as UDFs used by Pig's scripts) should be added to a your project as direct dependency. A proper resource management in pom.xml file is necessary to ensure that a given dependent module contains Pig's scripts. For example, the following inset in pom.xml should guarantee that requirement:
<build>
<resources>
<resource>
<directory>src/main/pig</directory>
<filtering>false</filtering>
<includes>
<include>**/*.pig</include>
</includes>
<excludes>
<exclude>**/AUXIL*.pig</excludes>
</excludes>
<targetPath>${project.build.directory}/classes/pig</targetpath>
</resource>
</resources>
</build>
Once the above inset is added to pom.xml, the instruction mvn install will add to a generated JAR a directory pig with files *.pig copied from src/main/pig.
Example 1: file src/main/pig/lorem/ipsum/dolor/sit.pig will appear in the JAR file as pig/lorem/ipsum/dolor/sit.pig.
Example 2: file src/main/pig/lorem/ipsum/dolor/AUXIL_sit.pig will not be added to the JAR file, because it was excluded.
The generation of workflows that utilize Pig's scripts is described here. In short: Pig's scripts are indicated in tags <script></script> , while imported scripts (i.e. macros) should be included in tags <file></file>. This in NOT the way OMP works.
OMP allows to work with Pig's scripts in two ways:
- EASY
- COMPLEX
- In this strategy, the POM file needs to contain the description of all used Pig's scripts in the folder
pig(see section "Modification of standard JAR file"). - This strategy is very easy and it is a highly recommended way of working with OMP.
- OMP automatically manages macro files:
- in the JAR file, macro files have to be in a different path than
/pig, e.g./pig/macros - it is recommended to put regular scripts in
src/main/pig, while macros should go intosrc/main/pig/macros - you CANNOT use tags
<file></file>to store macro files (it can cause errors).
- in the JAR file, macro files have to be in a different path than
- When you have already implemented a module with large number of scripts, then adjusting the code to the EASY way of working with Pig's scripts may be tedious. In that case, one can use COMPLEX strategy that has two disadvantages:
- file
pom.xmlwill be longer - additional descriptor files are necessary.
- Place executable scripts in
pig-scriptsand macros inpig-macros, e.g. pom.xml - Place information about the localization of a descriptor in workflow's
pom.xml- only the first descriptor will be used e.g. pom.xml - Add the descriptor, e.g. descriptor.xml, conforming the following XSD schema:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema version="1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
attributeFormDefault="unqualified"
>
<xs:complexType name="includesType">
<xs:sequence>
<xs:element name="include" minOccurs="0" maxOccurs="unbounded" type="xs:string"/>
</xs:sequence>
</xs:complextype>
<xs:complexType name="excludesType">
<xs:sequence>
<xs:element name="exclude" minOccurs="0" maxOccurs="unbounded" type="xs:string"/>
</xs:sequence>
</xs:complextype>
<xs:complexType name="scriptHandlingType">
<xs:sequence>
<xs:element name="root" maxOccurs="1" type="xs:string" default="pig" />
<xs:element name="preserve" maxOccurs="1" type="xs:boolean" default="true"/>
<xs:element name="target" maxOccurs="1" type="xs:string" default="/"/>
<xs:element name="includes" minOccurs="1" maxOccurs="1" type="includesType"/>
<xs:element name="excludes" minOccurs="0" maxOccurs="1" type="excludesType"/>
</xs:sequence>
</xs:complextype>
<xs:complexType name="mainProjectPigType">
<xs:sequence>
<xs:element name="scripts" type="scriptHandlingType"/>
<xs:element name="macros" type="scriptHandlingType"/>
</xs:sequence>
</xs:complextype>
<xs:complexType name="depsProjectPigType">
<xs:sequence>
<xs:element name="all-scripts" type="scriptHandlingType"/>
</xs:sequence>
</xs:complextype>
<xs:complexType name="oozieMavenPluginType">
<xs:sequence>
<xs:element name="main-project-pig" type="mainProjectPigType"/>
<xs:element name="deps-project-pig" type="depsProjectPigType"/>
</xs:sequence>
</xs:complextype>
<xs:element name="oozie-maven-plugin" type="oozieMavenPluginType"/>
</xs:schema>
In a descriptor file, in <build><resources> tag, one needs to specify whether a given script is a main script or a macro. Rules for inclusion/exclusion of main scripts are described in tags <main-project-pig><scripts> and <main-project-pig><macros>.
Oozie Maven Plugin defines the following integration test phases: pre-integration-test, integration-test and post-integration-test. Each integration test sends the following data to an Oozie server: workflow's definition, required libraries and test input data.
Configuration files for integration tests are in src/test/resources/configIT directory. The configuration is divided into two parts described below.
Environment configuration is stored in src/test/resources/configIT/env/IT-env-.properties files. denotes the name of a profile. You can create several profiles and choose the current one with -DIT.env=profile_name option. The default profile name is local, so, unless you indicate otherwise, OMP will use src/test/resources/configIT/env/IT-env-local.properties file.
Properties file should contain the following variables:
- oozieServiceURI -- the address of Oozie server, e.g. http://localhost:11000/oozie/
- nameNode -- the address of HDFS' namenode, e.g. localhost:8020 or hdfs://localhost:8020
- jobTracker -- the address of Job Tracker, e.g. localhost:8021
- queueName -- the name of a queue, usually "default"
- hdfsUserName -- the name of a user in HDFS
- hdfsWorkingDirURI -- the address of a working directory in HDFS, where you store a workflow, input and output data. During tests, everything should happen in that directory. This directory should not exist before a test is executed, it is created at the beginning of a test and removed when the test is finished. hdfsWorkingDirURI variable should be of the form hdfs://server:port/directory/ (or webhdfs://...). You should pay attention to double "/" characters that can cause problems when specified right after server:port.
- wfDir -- a directory in which a workflow's definition will be placed, it will be created as a subdirectory of hdfsWorkingDirURI