Skip to content

chollinger93/beam-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Beam Examples

Examples for Apache Beam / Dataflow in Python and go.

Based on my blog at chollinger.com/blog/.

Please see:

  1. A Data Engineering Perspective on Go vs. Python (Part 1)
  2. A Data Engineering Perspective on Go vs. Python (Part 2)

Use Case

Shows differences betwen Python and go for Apache Beam by implementing a use case to parse IMDb movie data to find movies that match preferences.

arch

Implementations

Both go and Python code implement 3 ways of doing this, in increasing order of performance:

  1. Using lists as Side Input and comparing for each element
  2. Using dicts/maps as Side Input and looking up a match for each element (does not work on Dataflow with go)
  3. Using CoGroupByKey

Run Python

cd python/
pip3 install apache-beam==2.22.0 --upgrade
SUFFIX= # set to cogroup, side_list, side_map
python3 movie_pipeline_$SUFFIX.py --input-basics ../data/title.basics.100.tsv --output ./test.txt --input-ratings ../data/title.ratings.100.tsv

Run go

cd go/
go get -u github.com/apache/beam/sdks/go/..
SUFFIX= # set to cogroup, movie_pipeline_side_list, side_map
go run movie_pipeline$SUFFIX.go --input-basics ../data/title.basics.100.tsv --input-ratings ../data/title.ratings.100.tsv --output ./test.txt

Performance

Side Input DirectRunner Performance

CoGroupByKey DirectRunner Performance

License

This project is licensed under the GNU GPLv3 License - see the LICENSE file for details.

About

Examples for Apache Beam / Dataflow in Python and go.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published