Skip to content

Latest commit

 

History

History
14 lines (7 loc) · 888 Bytes

README.md

File metadata and controls

14 lines (7 loc) · 888 Bytes

SparkSlidingAggregation

Implementation of minimal map reduce sliding aggregation algorithm in pyspark:

Authors of algortithm: Yufei Tao, Wenqing Lin, Xiaokui Xiao

Link to paper describing algorithm:

https://dl.acm.org/doi/10.1145/2463676.2463719

https://www.cse.cuhk.edu.hk/~taoyf/paper/sigmod13-mr.pdf

Yellow Taxi Trip Records (CSV) data from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page for January 2021. For each record I've computed the average ride distance and the average passenger occupancy during the last 1000 rides. The algorithm is minimal and follows the one from the paper. It Uses Spark RDD API Python.