Skip to content

MatdeB-SL/Spark-Performance---Cycle-Hire-Data

Repository files navigation

Readme

This project splits the London Cycle Hire Data into two subsets, Weekend Journeys and Weekday Journeys, and saves the result as Parquet. This action is implemented twice, both in a slow and quick fashion, as discussed in this blog post, and will produce ~2GB of output data.

To work correctly you will need to download all the data files and put them in a folder called data/full-bike

It is configured to be built and passed to Spark-Submit on a Yarn cluster, with a command as follows:

spark-submit   \
--class com.scottlogic.blog.analysis.BikeDataAnalysis \
--master yarn   \
--deploy-mode client  \
--executor-memory 4g   \
--num-executors 2 \
--conf spark.executor.instances=2   \
--total-executor-cores 6   \
spark-shuffle-performance-1.0-SNAPSHOT.jar   10000

Alternatively it may be run locally by uncommenting line 18 in BikeDataAnalysis.java, and commenting line 19.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages