# Lesson 20 - Filtering Joins and Cross Joins

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Introduction

In this lesson, we will discuss the last three types of joins: **semi joins**, **anti joins**, and **cross joins**. 

To demonstrate these joins, we will recreate the `LDF` and `RDF` DataFrames from the previous lesson.

In [0]:
left_schema = 'id INTEGER, c1 STRING, c2 DOUBLE'

left_list = [
    [101, 'C', 3.6],
    [101, 'B', 1.7],
    [103, 'A', 2.8],
    [104, 'A', 4.7],
    [105, 'B', 3.9],
    [106, 'C', 4.2]
]

LDF = spark.createDataFrame(left_list, schema=left_schema)

In [0]:
LDF.show()

In [0]:
right_schema = 'id INTEGER, d1 STRING, d2 INTEGER'

right_list = [
    [101, 'A', 17],
    [102, 'C', 24],
    [102, 'A', 32],
    [104, 'B', 16],
    [104, 'B', 19],
    [105, 'A', 25]
]

RDF = spark.createDataFrame(right_list, schema=right_schema)

In [0]:
RDF.show()

## Filtering Joins

A **filtering join** is an asymmetric join that returns a filtered version of the left DataFrame. The columns in the DataFrame resulting from a filtering join are identical to those from the left DataFrame, while the rows contained in the result are a subset of those in the left DataFrame. There are two types of filtering joins: **semi joins** and **anti joins**.

* A **semi join** returns only those rows from the left DataFrame that are key-matched with rows from the right DataFrame.
* An **anti join** returns only those rows from the left DataFrame that are not key-matched with any rows from the right DataFrame.

### Semi Joins

We will begin by demonstrating the semi join.

In [0]:
LDF.join(other=RDF, on='id', how='semi').show()

For convenience, the contents of `LDF`, `RDF`, and the semi join are all provided below. 

    LDF              RDF              Semi Join
    +---+---+---+    +---+---+---+    +---+---+---+
    | id| c1| c2|    | id| d1| d2|    | id| c1| c2|
    +---+---+---+    +---+---+---+    +---+---+---+
    |101|  C|3.6|    |101|  A| 17|    |101|  B|1.7|
    |101|  B|1.7|    |102|  C| 24|    |101|  C|3.6|
    |103|  A|2.8|    |102|  A| 32|    |104|  A|4.7|
    |104|  A|4.7|    |104|  B| 16|    |105|  B|3.9|
    |105|  B|3.9|    |104|  B| 19|    +---+---+---+
    |106|  C|4.2|    |105|  A| 25|    
    +---+---+---+    +---+---+---+

### Anti Joins

We will now demonstrate the anti join.

In [0]:
LDF.join(other=RDF, on='id', how='anti').show()

For convenience, the contents of `LDF`, `RDF`, and the anti join are all provided below. 

    LDF              RDF              Anti Join
    +---+---+---+    +---+---+---+    +---+---+---+
    | id| c1| c2|    | id| d1| d2|    | id| c1| c2|
    +---+---+---+    +---+---+---+    +---+---+---+
    |101|  C|3.6|    |101|  A| 17|    |103|  A|2.8|
    |101|  B|1.7|    |102|  C| 24|    |106|  C|4.2|
    |103|  A|2.8|    |102|  A| 32|    +---+---+---+
    |104|  A|4.7|    |104|  B| 16|    
    |105|  B|3.9|    |104|  B| 19|    
    |106|  C|4.2|    |105|  A| 25|    
    +---+---+---+    +---+---+---+

## Cross Joins

Unlike the other join operations, the **cross join** does not require a key column. It behaves a bit like an inner join, but it assumes that every row in the left DataFrame is matched with every row in the right DataFrame. In other words, the DataFrame returned by a cross join will contain one row for every possible pairs of rows in when one row is selected from the left DataFrame and one is selected from the right. This means that if the left DataFrame contains `N` rows and the right DataFrame contains `M` rows, then the cross-joined DataFrame will contain `NxM` rows. The resulting DataFrame will contain one column for each column that appears in either of the original DataFrames.

To illustrate the cross join operation, we will create two very small DataFrames.

In [0]:
X = spark.createDataFrame(
    data = [['A', 1, 10], ['A', 2, 20], ['B', 2, 30]],
    schema = 'c1 STRING, c2 INTEGER, c3 INTEGER'
)

Y = spark.createDataFrame(
    data = [['P', 'cat'], ['Q', 'dog']],
    schema = 'd1 STRING, d2 STRING'
)

X.show()
Y.show()

In [0]:
X.crossJoin(Y).show()