Apache Spark

To run these examples download and install Apache Spark version 1.6.1


Apache Spark example is an SBT subproject. To run it complete steps as follows:

  1. build a fat-jar of the example with sbt ";++2.10.6;exampleSpark/assembly"
  2. run it with Spark by executing $SPARK_HOME/bin/spark-submit ./examples/spark/target/scala-2.10/gnparser-example-spark-assembly-1.0.0.jar


  1. build a fat-jar of the gnparser's spark-python project with sbt ";++2.10.6;sparkPython/assembly". The project provides a thin wrapper for allowing transformation of input (RDD[String] scientific names) to the output (RDD[String] parsed results in compact JSON format).
  2. run pyspark with command:
$SPARK_HOME/bin/pyspark \
  --jars "`pwd`/spark-python/target/scala-2.10/gnparser-spark-python-assembly-1.0.0.jar" \
  1. add Python snippet to call the wrapper:
def parse(names):
    from pyspark.mllib.common import _py2java, _java2py
    parser =
    result = parser.parse(_py2java(sc, names), False, False)
    return _java2py(sc, result)
  1. now scientific name strings can be parsed in your program as follows:
names = sc.parallelize(["Homo sapiens Linnaeus 1758",
                        "Salinator solida (Martens, 1878)",
                        "Taraxacum officinale F. H. Wigg."])

import json
canonical_names = parse(names) \
  .map(lambda r: json.loads(r)) \
  .map(lambda j: (j["name_string_id"], j["canonical_name"]["value"])) \

print canonical_names

# [(u'208eb0ea-40e3-5894-9b7d-664721bd24e6', u'Homo sapiens'),
#  (u'b0f8459f-8b73-514c-b6f3-568d54d99ded', u'Salinator solida'),
#  (u'c2ab9908-ea25-57e1-835a-06b9d1ade53b', u'Taraxacum officinale')]