# Lesson 19 - Inner and Outer Joins

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Introduction

Up to this point, our examples have been confined to examples in which all of the relevant data was able to be stored in a single DataFrame. In practice, it is often the case that we will need to work with multiple DataFrames at the same time, with each DataFrame containing different aspects of our data. In this lesson we will discuss **joins**, which are transformations that allow us to create new DataFrames by combining information from multiple DataFrames at once. 

To demonstrate the different types of joins, we will create two simple DataFrames `LDF` and `RDF`.

In [0]:
left_schema = 'id INTEGER, c1 STRING, c2 DOUBLE'

left_list = [
    [101, 'C', 3.6],
    [101, 'B', 1.7],
    [103, 'A', 2.8],
    [104, 'A', 4.7],
    [105, 'B', 3.9],
    [106, 'C', 4.2]
]

LDF = spark.createDataFrame(left_list, schema=left_schema)

In [0]:
LDF.show()

In [0]:
right_schema = 'id INTEGER, d1 STRING, d2 INTEGER'

right_list = [
    [101, 'A', 17],
    [102, 'C', 24],
    [102, 'A', 32],
    [104, 'B', 16],
    [104, 'B', 19],
    [105, 'A', 25]
]

RDF = spark.createDataFrame(right_list, schema=right_schema)

In [0]:
RDF.show()

## Join Syntax

There are 7 types of joins that we will discuss in this course: **Inner joins**, **outer joins**, **left outer joins**, **right outer joins**, **left semi joins**, **left anti joins**, and **cross joins**. With the exception of cross joins, each of these joins is performed using the `join()` transformation. This transformation has three parameters:
* **`other`** is used to specify the name of the DataFrame to be joined with the DataFrame `join` was called from. 
* **`on`** is used to specify columns upon when the data is to be joined. The argument provided to `on` should be either the name of a key column that exists in both DataFrames or a list of such names.
* **`how`** is used to specify the type of join desired. The table below shows the string values that can be used to perform various each type of join.



| Join Type        | Argument for how Parameter                            | 
|------------------|-------------------------------------------------------|
| Inner Join       | `'inner'`                                             |
| Outer Join	   | `'outer'`, `'full'`, `'fullouter'`, or `'full_outer'` |
| Left Outer Join  |`'left'`, `'leftouter'`, or `'left_outer'`             |
| Right Outer Join |`'right'`, `'rightouter'`, or `'right_outer'`          |
| Semi Join	       |`'semi'`, `'leftsemi'`, or `'left_semi'`               |
| Anti Join	       |`'anti'`, `'leftanti'`, or `'left_anti'`               |

## Inner Joins

When an **inner join** is performed on two DataFrames, the result is a new DataFrame that contains one row for each pair of key-matched rows from the original DataFrames. The joined DataFrame will contain one column for key being joined on as well as one column for each non-key column in either of the original DataFrames.

- Inner join is symetric, so `RDF.join(other=LDF, on='id', how='inner').show()` would do the same.

In [0]:
LDF.join(other=RDF, on='id', how='inner').show()

For convenience, the contents of `LDF`, `RDF`, and the inner join are all provided below. 

    LDF              RDF              Inner Join
    +---+---+---+    +---+---+---+    +---+---+---+---+---+
    | id| c1| c2|    | id| d1| d2|    | id| c1| c2| d1| d2|
    +---+---+---+    +---+---+---+    +---+---+---+---+---+
    |101|  C|3.6|    |101|  A| 17|    |101|  C|3.6|  A| 17|
    |101|  B|1.7|    |102|  C| 24|    |101|  B|1.7|  A| 17|
    |103|  A|2.8|    |102|  A| 32|    |104|  A|4.7|  B| 16|
    |104|  A|4.7|    |104|  B| 16|    |104|  A|4.7|  B| 19|
    |105|  B|3.9|    |104|  B| 19|    |105|  B|3.9|  A| 25|
    |106|  C|4.2|    |105|  A| 25|    +---+---+---+---+---+
    +---+---+---+    +---+---+---+

## Outer Joins

When an **outer join** is performed on two DataFrames, the resulting DataFrame contains every row that would be included in an inner join, plus one row for every row in either of the original DataFrames that was not key-matched with a row from the other DataFrame. The columns contained in an outer join are the same as those in an inner join. 

- Outer join is also symmetric

In [0]:
LDF.join(other=RDF, on='id', how='outer').show()

For convenience, the contents of `LDF`, `RDF`, and the outer join are all provided below. 

    LDF              RDF              Outer Join
    +---+---+---+    +---+---+---+    +---+----+----+----+----+
    | id| c1| c2|    | id| d1| d2|    | id|  c1|  c2|  d1|  d2|
    +---+---+---+    +---+---+---+    +---+----+----+----+----+
    |101|  C|3.6|    |101|  A| 17|    |101|   B| 1.7|   A|  17|
    |101|  B|1.7|    |102|  C| 24|    |101|   C| 3.6|   A|  17|
    |103|  A|2.8|    |102|  A| 32|    |102|null|null|   C|  24|
    |104|  A|4.7|    |104|  B| 16|    |102|null|null|   A|  32|
    |105|  B|3.9|    |104|  B| 19|    |103|   A| 2.8|null|null|
    |106|  C|4.2|    |105|  A| 25|    |104|   A| 4.7|   B|  16|
    +---+---+---+    +---+---+---+    |104|   A| 4.7|   B|  19|
                                      |105|   B| 3.9|   A|  25|
                                      |106|   C| 4.2|null|null|
                                      +---+----+----+----+----+

## Left Outer Joins

A **left outer join** contains all of the rows and columns included in an inner join, as well as one row for every row in the left DataFrame that has no match in the right DataFrame.

In [0]:
LDF.join(other=RDF, on='id', how='left').show()

For convenience, the contents of `LDF`, `RDF`, and the left outer join are all provided below. 
                                     
    LDF              RDF              Left Outer Join    
    +---+---+---+    +---+---+---+    +---+----+----+----+----+
    | id| c1| c2|    | id| d1| d2|    | id|  c1|  c2|  d1|  d2|
    +---+---+---+    +---+---+---+    +---+----+----+----+----+
    |101|  C|3.6|    |101|  A| 17|    |101|   B| 1.7|   A|  17|
    |101|  B|1.7|    |102|  C| 24|    |101|   C| 3.6|   A|  17|
    |103|  A|2.8|    |102|  A| 32|    |103|   A| 2.8|null|null|
    |104|  A|4.7|    |104|  B| 16|    |104|   A| 4.7|   B|  16|
    |105|  B|3.9|    |104|  B| 19|    |104|   A| 4.7|   B|  19|
    |106|  C|4.2|    |105|  A| 25|    |105|   B| 3.9|   A|  25|
    +---+---+---+    +---+---+---+    |106|   C| 4.2|null|null|                                     
                                      +---+----+----+----+----+

## Right Outer Joins

A **Right outer join** contains all of the rows and columns included in an inner join, as well as one row for every row in the right DataFrame that has no match in the left DataFrame.

In [0]:
LDF.join(other=RDF, on='id', how='right').show()

For convenience, the contents of `LDF`, `RDF`, and the right outer join are all provided below. 
                                      
    LDF              RDF              Right Outer Join
    +---+---+---+    +---+---+---+    +---+----+----+----+----+
    | id| c1| c2|    | id| d1| d2|    | id|  c1|  c2|  d1|  d2|
    +---+---+---+    +---+---+---+    +---+----+----+----+----+
    |101|  C|3.6|    |101|  A| 17|    |101|   B| 1.7|   A|  17|
    |101|  B|1.7|    |102|  C| 24|    |101|   C| 3.6|   A|  17|
    |103|  A|2.8|    |102|  A| 32|    |102|null|null|   C|  24|
    |104|  A|4.7|    |104|  B| 16|    |102|null|null|   A|  32|
    |105|  B|3.9|    |104|  B| 19|    |104|   A| 4.7|   B|  16|
    |106|  C|4.2|    |105|  A| 25|    |104|   A| 4.7|   B|  19|
    +---+---+---+    +---+---+---+    |105|   B| 3.9|   A|  25|
                                      +---+----+----+----+----+