Skip to content

BrentDorsey/beamexample

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 

Repository files navigation

Apache Beam Example Code

An example Apache Beam project.

Description

This example can be used with conference talks and self-study. The base of the examples are taken from Beam's example directory. They are modified to use Beam as a dependency in the pom.xml instead of being compiled together. The example code is changed to output to local directories.

How to clone and run

  1. Open a terminal window.
  2. Run git clone git@github.com:eljefe6a/beamexample.git
  3. Run cd beamexample/BeamTutorial
  4. Run mvn compile
  5. Create local output directory: mkdir output
  6. Run mvn exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1"
  7. Run cat output/user_score to verify the program ran correctly and the output file was created.

Using a Java IDE

  1. Follow the IDE Setup instructions on the Apache Beam Contribution Guide.

Other Runners

Apache Flink

  1. Follow the first steps from Flink's Quickstart to download Flink.
  2. Create the output directory.
  3. To run on a JVM-local cluster: mvn exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=[local]'
  4. To run on an out-of-process local cluster (note that the steps below should also work on a real cluster if you have one running):
    1. Start a local Flink cluster.
    2. Navigate to the WebUI (typically http://localhost:8081), click JobManager, and note the value of jobmanager.rpc.port. The default is probably 6123.
    3. Run mvn package to generate a JAR file. Note the location of the generated JAR (probably target/Tutorial-0.0.1-SNAPSHOT.jar)
    4. Run mvn -X -e exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=localhost:6123 --filesToStage=target/Tutorial-0.0.1-SNAPSHOT.jar', replacing the defaults for port and JAR file if they differ.
    5. Check in the WebUI to see the job listed.
  5. Run cat output/user_score to verify the pipeline ran correctly and the output file was created.

Apache Spark

  1. Create the output directory.
  2. Allow all users (Spark may run as a different user) to write to the output directory. chmod 1777 output.
  3. Change the output file to a fully-qualified path. For example, this("output/user_score"); to this("/home/vmuser/output/user_score");
  4. Run mvn package
  5. Run spark-submit --jars ~/.m2/repository/org/apache/beam/beam-runners-spark/0.3.0-incubating-SNAPSHOT/beam-runners-spark-0.3.0-incubating-SNAPSHOT.jar --class org.apache.beam.examples.tutorial.game.solution.Exercise2 --master yarn-client target/Tutorial-0.0.1-SNAPSHOT.jar --runner=SparkRunner

Google Cloud Dataflow

  1. Follow the steps in either of the Java quickstarts for Cloud Dataflow to initialize your Google Cloud setup.
  2. Create a bucket on Google Cloud Storage for staging and output.
  3. Run mvn -X exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1" -Dexec.args='--runner=DataflowRunner --project=<YOUR-GOOGLE-CLOUD-PROJECT> --gcpTempLocation=gs://<YOUR-BUCKET-NAME> --outputPrefix=gs://<YOUR-BUCKET-NAME>/output/', after replacing <YOUR-GCP-PROJECT> and <YOUR-BUCKET-NAME> with the appropriate values.
  4. Check the Cloud Dataflow Console to see the job running.
  5. Check the output bucket to see the generated output: https://console.cloud.google.com/storage/browser/<YOUR-BUCKET-NAME>/

Further Reading

About

An example Apache Beam project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%