# PySpark Joins

One of the best ways to combine two dataframes together is by joining them. The join method can be used in many different ways.

### Types of joins

There are several ways you can join a dataframe each way changes how your dataframe will look and function

In [None]:
# imports
import pyspark.sql.functions as F
import pandas as pd 
from pyspark.sql import Row
from pyspark.sql.functions import concat, lit
from datetime import datetime, date

### Creating the Dataframe to Work With

In [None]:
names_data = [
    ("Luke Skywalker", "Rebel"),
    ("Darth Vader", "Empire"),
    ("Boba Fett", "Bounty Hunter")
]

names = spark.createDataFrame(names_data, ["name", "faction"])

names.show()

In [None]:
factions_data = [
    ("Jawas", "Neutral Evil"),
    ("Rebel", "Chaotic Good"),
    ("Empire", "Lawful Evil")
]

factions = spark.createDataFrame(factions_data, ["faction", "alignment"])

factions.show()

### Left Join / Left Outer

Left join is the first join that joins the right dataframe to the left, based on the column provided. Anything on the left that's not on the right is nulled. Anything on the right that's not on the left is not joined. 

In [None]:
names.show()
factions.show()

left_join = names.join(factions, on='faction', how='left')
left_join.show()

### Right / Right Outer

Joins the left dataframe to the right, based on the column provided. Anything on the right that's not on the left is nulled. Anything on the left that's not on the right is not joined.

In [None]:
names.show()
factions.show()

right_join = names.join(factions, on='faction', how='right')
right_join.show()

### Outer / Full Outer / Full
Joins both dataframe, filling the dataframe with null wherever the data don't align.

In [None]:
names.show()
factions.show()

outer_join = names.join(factions, on='faction', how='outer')
outer_join.show()

In [None]:
left_join.show()
right_join.show()
outer_join.show()

## The Inner, Semi, Cross and Anti Joins

First you will need dataframes to work with

In [None]:
from datetime import datetime, date
# need to import for working with pandas
import pandas as pd
# need to import to use pyspark
from pyspark.sql import Row
df = spark.createDataFrame([
    Row(a=1, b=1, c=3, d=1,
        e=1),
    Row(a=2, b=2, c=1, d=5,
        e=2),
    Row(a=4, b=5, c=7, d=12,
        e=3)
])
df2 = spark.createDataFrame([
    Row(f=1, g=2, h=5, i=2,
        j=2),
    Row(f=5, g=3, h=3, i=6,
        j=3),
    Row(f=4, g=6, h=8, i=12,
        j=4)
])
# show table
df.show()
df2.show()
df.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("df1")
df2.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("df2")

### Inner Join
Method to join two tables taking only entries that have the same key in both. Returns a dataframe with columns from both tables that match on a key.

In [None]:
display(spark.sql("SELECT * FROM df1 INNER JOIN df2 ON df1.a=df2.g"))
display(spark.sql("SELECT * FROM df1 INNER JOIN df2 ON df1.a=df2.f"))

### Semi Join
Method to join two tables taking only entries that have the same key in both. Returns a Dataframe with columns from one table that has a key match in the other table. Opposite of Anti.

In [None]:
display(spark.sql("SELECT * FROM df1 LEFT SEMI JOIN df2 ON df1.a=df2.g"))
display(spark.sql("SELECT * FROM df1 LEFT SEMI JOIN df2 ON df1.a=df2.f"))

### Anti Join
Method to join two tables taking only entries that do not have the same key in both. Returns a Dataframe with columns from one table that does not have a key match in the other table.

In [None]:
display(spark.sql("SELECT * FROM df1 LEFT ANTI JOIN df2 ON df1.a=df2.g"))
display(spark.sql("SELECT * FROM df1 LEFT ANTI JOIN df2 ON df1.a=df2.f"))

### Cross Join
Joins two tables by taking every possible combination of entries. Returns the cartesian product of two tables.

In [None]:
display(spark.sql("SELECT * FROM df1 CROSS JOIN df2"))

## Concat functions

### Pandas Concat function

In [None]:
data = [
    {'county': 'Clark County', 'state': 'Nevada', 'crime_rate': 0.5},
    {'county': 'Madison County', 'state': 'Idaho', 'crime_rate': 0.2},
    {'county': 'Yuma County', 'state': 'Colorado', 'crime_rate': 0.05}
]
df_1 = pd.DataFrame(data)
df_1

In [None]:
data = [
    {'county': 'Fairfax County', 'state': 'Virginia', 'crime_rate': 0.02},
    {'county': 'Bergen County', 'state': 'New Jersey', 'crime_rate': 0.06},
    {'county': 'Los Alamos County', 'state': 'New Mexico', 'crime_rate': 0.1}
]
df_2 = pd.DataFrame(data)
df_2

In [None]:
combined_df = pd.concat([df_1, df_2], ignore_index=True, axis=0)
combined_df

### Pyspark Concat Function Example

In [None]:
df1 = spark.createDataFrame([
    Row(county='Clark County', state='Nevada', crime_rate=0.5),
    Row(county='Madison County', state='Idaho', crime_rate=0.2),
    Row(county='Yuma County', state='Colorado', crime_rate=0.05)
])
df1.show()

In [None]:
df2 = spark.createDataFrame([
    Row(county='Fairfax County', state='Virginia', crime_rate=0.02),
    Row(county='Bergen County', state='New Jersey', crime_rate=0.06),
    Row(county='Los Alamos County', state='New Mexico', crime_rate=0.1)
])
df2.show()

In [None]:
concat_df = df1.withColumn('location', concat(df1.county, lit(', '), df1.state))
# df1 = df1.drop('county', 'state')
concat_df.show()

### Stack Dataframes with Union

In [None]:
df3 = spark.createDataFrame([
    Row(county='Jefferson County', state='Idaho', crime_rate=0.08, fips=200)
])
df4 = spark.createDataFrame([
    Row(crime_rate=0.08, county='Jefferson County', state='Idaho')
])

### Basic Union

In [None]:
combined = df1.union(df2)
combined.show()

### Swapped Columns 

In [None]:
combined = df1.union(df4)
combined.show()

### Extra Columns

In [None]:
combined = df1.unionByName(df3)
combined.show()