-
Functions to allow Pig scripts to index data to Solr or Fusion.
-
Supports Kerberos for authentication.
-
As of v2.2.4, integration with Lucidworks Fusion is supported, allowing use of Fusion pipelines for further data transformation prior to indexing.
Supported versions are:
-
Solr 5.x and higher
-
Pig versions 0.12 through 0.16
-
Hadoop 3.x
-
Fusion 2.4.x and higher
Warning
|
You must use Java 8 or higher to build the .jar. |
This project has a dependency on the solr-hadoop-common
submodule (contained in a separate GitHub repository, https://github.com/lucidworks/solr-hadoop-common). This submodule must be initialized before building the Functions jar.
To initialize the submodule, pull this repo, then:
git submodule init
Once the submodule has been initialized, the command git submodule update
will fetch all the data from that project and check out the appropriate commit listed in the superproject. You must initialize and update the submodule before attempting to build the Functions jar.
-
if a build is happening from a branch, please make sure that
solr-hadoop-common
is pointing to the correct SHA. (see https://github.com/blog/2104-working-with-submodules for more details)
pig-solr $ git checkout <branch-name> pig-solr $ cd solr-hadoop-common pig-solr/solr-hadoop-common $ git checkout <SHA> pig-solr/solr-hadoop-common $ cd ..
The build uses Gradle. However, you do not need Gradle installed before attempting to build.
To build the .jar file, run this command:
./gradlew clean shadowJar --info
This will make a .jar file:
solr-pig-functions/build/libs/solr-pig-functions-4.0.0.jar
The .jar is required to use the Pig functions.
If GitHub + SSH is not configured the following exception will be thrown:
Cloning into 'solr-hadoop-common'...
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:LucidWorks/solr-hadoop-common.git' into submodule path 'solr-hadoop-common' failed
To fix this error, use the generating an SSH key tutorial.
The Pig functions included in the solr-pig-functions-4.0.0.jar
are three UserDefined Functions (UDF) and two Store functions. These functions are:
-
com/lucidworks/hadoop/pig/SolrStoreFunc.class
-
com/lucidworks/hadoop/pig/FusionIndexPipelinesStoreFunc.class
-
com/lucidworks/hadoop/pig/EpochToCalendar.class
-
com/lucidworks/hadoop/pig/Extract.class
-
com/lucidworks/hadoop/pig/Histogram.class
There are two approaches to using functions in Pig: REGISTER
them in the script, or load them with your Pig command line request.
If using REGISTER
, the Pig function jars must be put in HDFS in order to be used by your Pig script. It can be located anywhere in HDFS; you can either supply the path in your script or use a variable and define the variable with -p
property definition.
The example below uses the second approach, loading the jars with the -Dpig.additional.jars
system property when launching the script. With this approach, the jars can be located anywhere on the machine where the script will be run.
There are a few required parameters for your script to output data to Solr for indexing.
These parameters can be defined in the script itself, or turned into variables that are defined each time the script runs. The example Pig script below shows an example of using these parameters with variables.
solr.zkhost
-
The ZooKeeper connection string if using Solr in SolrCloud mode. This should be in the form of
server:port,server:port,server:port/chroot
.
If you are not using SolrCloud, use the solr.server.url
parameter instead.
solr.server.url
-
The location of the Solr instance when Solr is running in standalone mode. This should be in the form of
http://server:port/solr
. solr.collection
-
The name of the Solr collection where documents will be indexed.
When a Solr cluster is secured with Kerberos for internode communication, Pig scripts must include the full path to a JAAS file that includes the service principal and the path to a keytab file that will be used to index the output of the script to Solr.
Two parameters provide the information the script needs to access the JAAS file:
lww.jaas.file
-
The path to the JAAS file that includes a section for the service principal who will write to the Solr indexes. For example, to use this property in a Pig script:
set lww.jaas.file '/path/to/login.conf';
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed).
lww.jaas.appname
-
The name of the section in the JAAS file that includes the correct service principal and keytab path. For example, to use this property in a Pig script:
set lww.jaas.appname 'Client';
Here is a sample section of a JAAS file:
Client { --(1)
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/data/solr-indexer.keytab" --(2)
storeKey=true
useTicketCache=true
debug=true
principal="solr-indexer@SOLRSERVER.COM"; --(3)
};
-
The name of this section of the JAAS file. This name will be used with the
lww.jaas.appname
parameter. -
The location of the keytab file.
-
The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and Pig.
When SSL is enabled in a Solr cluster, Pig scripts must include the full paths to the keystore
and truststore
with their respective passwords.
set lww.keystore '/path/to/solr-ssl.keystore.jks'
set lww.keystore.password 'secret'
set lww.truststore '/path/to/solr-ssl.truststore.jks'
set lww.truststore.password 'secret'
Tip
|
The paths (and secret configurations) should be the same in all YARN/MapReduce hosts. |
When indexing data to Fusion, there are several parameters to pass with your script in order to output data to Fusion for indexing.
These parameters can be made into variables in the script, with the proper values passed on the command line when the script is initiated. The example script below shows how to do this for Solr. The theory is the same for Fusion, only the parameter names would change as appropriate:
fusion.endpoints
-
The full URL to the index pipeline in Fusion. The URL should include the pipeline name and the collection data will be indexed to.
fusion.fail.on.error
-
If
true
, when an error is encountered, such as if a row could not be parsed, indexing will stop. This isfalse
by default. fusion.buffer.timeoutms
-
The amount of time, in milliseconds, to buffer documents before sending them to Fusion. The default is 1000. Documents will be sent to Fusion when either this value or
fusion.batchSize
is met. fusion.batchSize
-
The number of documents to batch before sending the batch to Fusion. The default is 500. Documents will be sent to Fusion when either this value or
fusion.buffer.timeoutms
is met. fusion.realm
-
This is used with
fusion.user
andfusion.password
to authenticate to Fusion for indexing data. Two options are supported,KERBEROS
orNATIVE
.Kerberos authentication is supported with the additional definition of a JAAS file. The properties
java.security.auth.login.config
andfusion.jaas.appname
are used to define the location of the JAAS file and the section of the file to use. These are described in more detail below.Native authentication uses a Fusion-defined username and password. This user must exist in Fusion, and have the proper permissions to index documents.
fusion.user
-
The Fusion username or Kerberos principal to use for authentication to Fusion.
If a Fusion username is used (
'fusion.realm' = 'NATIVE'
), thefusion.password
must also be supplied. fusion.pass
-
This property is not shown in the example above. The password for the
fusion.user
when thefusion.realm
isNATIVE
.
When Fusion is secured with Kerberos, Pig scripts must include the full path to a JAAS file that includes the service principal and the path to a keytab file that will be used to index the output of the script to Fusion.
Additionally, a Kerberos ticket must be obtained on the server for the principal using kinit
.
java.security.auth.login.config
-
This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to write to Fusion.
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:
Client { --(1) com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/data/fusion-indexer.keytab" --(2) storeKey=true useTicketCache=true debug=true principal="fusion-indexer@FUSIONSERVER.COM"; --(3) };
-
The name of this section of the JAAS file. This name will be used with the
fusion.jaas.appname
parameter. -
The location of the keytab file.
-
The service principal name. This should be a different principal than the one used for Fusion, but must have access to both Fusion and Pig. This name is used with the
fusion.user
parameter described above.
-
fusion.jaas.appname
-
Used only when indexing to or reading from Fusion when it is secured with Kerberos.
This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.
The following Pig script will take a simple CSV file and index it to Solr.
set solr.zkhost '$zkHost';
set solr.collection '$collection'; -- (1)
A = load '$csv' using PigStorage(',') as (id_s:chararray,city_s:chararray,country_s:chararray,code_s:chararray,code2_s:chararray,latitude_s:chararray,longitude_s:chararray,flag_s:chararray); -- (2)
--dump A;
B = FOREACH A GENERATE $0 as id, 'city_s', $1, 'country_s', $2, 'code_s', $3, 'code2_s', $4, 'latitude_s', $5, 'longitude_s', $6, 'flag_s', $7; -- (3)
ok = store B into 'SOLR' using com.lucidworks.hadoop.pig.SolrStoreFunc(); -- (4)
This relatively simple script is doing several things that help to understand how the Solr Pig functions work.
-
This and the line above define parameters that are needed by
SolrStoreFunc
to know where Solr is.SolrStoreFunc
needs the propertiessolr.zkhost
andsolr.collection
, and these lines are mapping thezkhost
andcollection
parameters we will pass when invoking Pig to the required properties. -
Load the CSV file, the path and name we will pass with the
csv
parameter. We also define the field names for each column in CSV file, and their types. -
For each item in the CSV file, generate a document id from the first field (
$0
) and then define each field name and value inname, value
pairs. -
Load the documents into Solr, using the
SolrStoreFunc
. While we don’t need to define the location of Solr here, the function will use thezkhost
andcollection
properties that we will pass when we invoke our Pig script.
Warning
|
When using SolrStoreFunc , the document ID must be the first field.
|
When we want to run this script, we invoke Pig and define several parameters we have referenced in the script with the -p
option, such as in this command:
./bin/pig -Dpig.additional.jars=/path/to/solr-pig-functions-4.0.0.jar -p csv=/path/to/my/csv/airports.dat -p zkHost=zknode1:2181,zknode2:2181,zknode3:2181/solr -p collection=myCollection ~/myScripts/index-csv.pig
The parameters to pass are:
csv
-
The path and name of the CSV file we want to process.
zkhost
-
The ZooKeeper connection string for a SolrCloud cluster, in the form of
zkhost1:port,zkhost2:port,zkhost3:port/chroot
. In the script, we mapped this to thesolr.zkhost
property, which is required by theSolrStoreFunc
to know where to send the output documents. collection
-
The Solr collection to index into. In the script, we mapped this to the
solr.collection
property, which is required by theSolrStoreFunc
to know the Solr collection the documents should be indexed to.
Tip
|
The If, however, you are not using SolrCloud, you can use the In the script, you would change the line that maps `set solr.server.url '$solrUrl';` |
-
Fork this repo i.e. <username|organization>/hadoop-solr, following the fork a repo tutorial. Then, clone the forked repo on your local machine:
$ git clone https://github.com/<username|organization>/hadoop-solr.git
-
Configure remotes with the configuring remotes tutorial.
-
Create a new branch:
$ git checkout -b new_branch $ git push origin new_branch
Use the creating branches tutorial to create the branch from GitHub UI if you prefer.
-
Develop on
new_branch
branch only, do not mergenew_branch
to your master. Commit changes tonew_branch
as often as you like:$ git add <filename> $ git commit -m 'commit message'
-
Push your changes to GitHub.
$ git push origin new_branch
-
Repeat the commit & push steps until your development is complete.
-
Before submitting a pull request, fetch upstream changes that were done by other contributors:
$ git fetch upstream
-
And update master locally:
$ git checkout master $ git pull upstream master
-
Merge master branch into
new_branch
in order to avoid conflicts:$ git checkout new_branch $ git merge master
-
If conflicts happen, use the resolving merge conflicts tutorial to fix them:
-
Push master changes to
new_branch
branch$ git push origin new_branch
-
Add jUnits, as appropriate, to test your changes.
-
When all testing is done, use the create a pull request tutorial to submit your change to the repo.
Note
|
Please be sure that your pull request sends only your changes, and no others. Check it using the command:
|