# Lesson 21 - Additional Join Topics

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Joining on Two Columns 

It is possible to join two DataFrames using two or more key columns. In such a situation, two rows from different DataFrames are considered to be key-matched if and only if they share the same value in each of the key columns. To illustrate this concept, we will create two DataFrames named `LDF` and `RDF`.

In [0]:
LDF = spark.createDataFrame(
    data = [['A', 1, 10], ['A', 2, 20], ['B', 2, 30]],
    schema = 'c1 STRING, c2 INTEGER, c3 INTEGER'
)

RDF = spark.createDataFrame(
    data = [['A', 1, 40], ['A', 1, 50], ['B', 1, 60], ['B', 2, 70]],                          
    schema = 'c1 STRING, c2 INTEGER, c4 INTEGER'
)

LDF.show()
RDF.show()

In [0]:
LDF.join(other=RDF, on=['c1','c2'], how='inner').show()

For convenience, the contents of `LDF`, `RDF`, and the inner join are all provided below. 
    
    LDF              RDF              Inner Join
    +---+---+---+    +---+---+---+    +---+---+---+---+
    | c1| c2| c3|    | c1| c2| c4|    | c1| c2| c3| c4|
    +---+---+---+    +---+---+---+    +---+---+---+---+
    |  A|  1| 10|    |  A|  1| 40|    |  A|  1| 10| 40|
    |  A|  2| 20|    |  A|  1| 50|    |  A|  1| 10| 50|
    |  B|  2| 30|    |  B|  1| 60|    |  B|  2| 30| 70|
    +---+---+---+    |  B|  2| 70|    +---+---+---+---+
                     +---+---+---+

## Joining on Differently-Named Key Columns

There are occasions when we need to join two DataFrames using key columns that have different names in the two DataFrames being joined. A fairly straight-forward way to address this issue is to simply rename the key column in one of the two DataFrames. If, for whatever reason, that is undesirable, then we can still perform the join by passing to the on parameter an expression performing an equality comparison between the two key columns.

In [0]:
X = spark.createDataFrame(
  data = [[1, 'A'], [2, 'B'], [2, 'C'], [3, 'D']],
  schema = 'c1 INT, c2 STRING'
)

Y = spark.createDataFrame(
  data = [[1, 'E'], [1, 'F'], [3, 'G'], [3, 'H']],
  schema = 'd1 INT, d2 STRING'
)

X.show()
Y.show()

In [0]:
X.join(other=Y, on=(X.c1 == Y.d1), how='inner').show()

For convenience, the contents of `X`, `Y`, and the inner join are all provided below. 
    
    X            Y            Inner Join
    +---+---+    +---+---+    +---+---+---+---+
    | c1| c2|    | d1| d2|    | c1| c2| d1| d2|
    +---+---+    +---+---+    +---+---+---+---+
    |  1|  A|    |  1|  E|    |  1|  A|  1|  E|
    |  2|  B|    |  1|  F|    |  1|  A|  1|  F|
    |  2|  C|    |  3|  G|    |  3|  D|  3|  G|
    |  3|  D|    |  3|  H|    |  3|  D|  3|  H|
    +---+---+    +---+---+    +---+---+---+---+