Skip to content

jaquejbrito/star-join-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Star Join

You will find in this repository the implementation of two efficient solutions for Star Joins using Spark framework, dropping the computation time by at least 60% when compared to other solutions available. Namely, they are the Spark Bloom-Filtered Cascade Join (SBFCJ) and the Spark Broadcast Join (SBJ). Each of these strategies excel in different scenarios (for details, click here): SBJ is twice faster when the memory available to each executor is large enough; and SBFCJ is remarkably resilient to low memory scenarios. As of now, these algorithms are very competitive, and may be easily combined with other technologies for further improvement (such as new data types or file managements).

You will also find a direct Spark implementation of a sequence of joins, which delivers very poor performance and is far from being eligible as a good solution. This shows the importance of additional filtering.

These strategies were presented and studied in our recent paper, Brito et al, "Faster Cloud Star Joins with Reduced Disk Spill and Network Communication". If you find it useful for your own research/applications, please cite our work and/or star our repository. If you need more information or have suggestions, feel free to either contact me or make a PR. Feedback is very important.

Relevant papers

If you find this useful, please star this repository and/or cite our paper:

Jaqueline Brito, Thiago Mosqueiro, Ricardo R Ciferri and Cristina DA Ciferri. Faster Cloud Star Joins with Reduced Disk Spill and Network Communication. Chemometrics and Intelligent Laboratory Systems 2016.

Acknowledgements

We acknowledge Microsoft Azure Research grant MS-AZR-0036P, FAPESP grant 2012/13158-9 and CNPq grant 234817/2014-3.

License

Feel free to use this code for studying, applying to your own problems, or anything that complies with the MIT License (MIT), available in the folder License. If you use this code, we kindly ask that you cite our paper Brito et al, where both of these strategies were presented and studied.

About

Two efficient algorithms to process Star Joins using the Spark framework: Spark Bloom-Filtered Cascade Join (SBFCJ) and the Spark Broadcast Join (SBJ)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages