## Case Study 4: Range Join Conditions [TODO]

> A naive approach (just specifying this as the range condition) would result in a full cartesian product and a filter that enforces the condition (tested using Spark 2.0). This has a horrible effect on performance, especially if DataFrames are more than a few hundred thousands records.

source: http://zachmoshe.com/2016/09/26/efficient-range-joins-with-spark.html

> The source of the problem is pretty simple. When you execute join and join condition is not equality based the only thing that Spark can do right now is expand it to Cartesian product followed by filter what is pretty much what happens inside `BroadcastNestedLoopJoin`

source: https://stackoverflow.com/questions/37953830/spark-sql-performance-join-on-value-between-min-and-max?answertab=active#tab-top

### Library Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Create a `SparkSession`. No need to create `SparkContext` as you automatically get it as part of the `SparkSession`.

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Exploring Joins") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sc = spark.sparkContext

### Initial Dataset

In [3]:
geo_loc_table = spark.createDataFrame([
    (1, 10, "foo"), 
    (11, 36, "bar"), 
    (37, 59, "baz"),
], ["ipstart", "ipend", "loc"])

geo_loc_table.toPandas()

Unnamed: 0,ipstart,ipend,loc
0,1,10,foo
1,11,36,bar
2,37,59,baz


In [4]:
records_table = spark.createDataFrame([
    (1, 11), 
    (2, 38), 
    (3, 50),
],["id", "inet"])

records_table.toPandas()

Unnamed: 0,id,inet
0,1,11
1,2,38
2,3,50


### Option #1

In [5]:
join_condition = [
    records_table['inet'] >= geo_loc_table['ipstart'],
    records_table['inet'] <= geo_loc_table['ipend'],
]

df = records_table.join(geo_loc_table, join_condition, "left")

df.toPandas()

Unnamed: 0,id,inet,ipstart,ipend,loc
0,1,11,11,36,bar
1,2,38,37,59,baz
2,3,50,37,59,baz


In [6]:
df.explain()

== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftOuter, ((inet#7L >= ipstart#0L) && (inet#7L <= ipend#1L))
:- Scan ExistingRDD[id#6L,inet#7L]
+- BroadcastExchange IdentityBroadcastMode
   +- Scan ExistingRDD[ipstart#0L,ipend#1L,loc#2]


### Option #2

In [7]:
from bisect import bisect_right
from pyspark.sql.functions import udf
from pyspark.sql.types import LongType

geo_start_bd = spark.sparkContext.broadcast(map(lambda x: x.ipstart, geo_loc_table
    .select("ipstart")
    .orderBy("ipstart")
    .collect()
))

def find_le(x):
    'Find rightmost value less than or equal to x'
    i = bisect_right(geo_start_bd.value, x)
    if i:
        return geo_start_bd.value[i-1]
    return None

records_table_with_ipstart = records_table.withColumn(
    "ipstart", udf(find_le, LongType())("inet")
)

df = records_table_with_ipstart.join(geo_loc_table, ["ipstart"], "left")

df.toPandas()

Unnamed: 0,ipstart,id,inet,ipend,loc
0,37,2,38,59,baz
1,37,3,50,59,baz
2,11,1,11,36,bar


In [8]:
df.explain()

== Physical Plan ==
*(4) Project [ipstart#27L, id#6L, inet#7L, ipend#1L, loc#2]
+- SortMergeJoin [ipstart#27L], [ipstart#0L], LeftOuter
   :- *(2) Sort [ipstart#27L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(ipstart#27L, 200)
   :     +- *(1) Project [id#6L, inet#7L, pythonUDF0#36L AS ipstart#27L]
   :        +- BatchEvalPython [find_le(inet#7L)], [id#6L, inet#7L, pythonUDF0#36L]
   :           +- Scan ExistingRDD[id#6L,inet#7L]
   +- *(3) Sort [ipstart#0L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(ipstart#0L, 200)
         +- Scan ExistingRDD[ipstart#0L,ipend#1L,loc#2]
