# PySpark Union & Union All   
        by Aishwarya Raut

PySpark union() and unionAll() transformations are used to merge two or more DataFrame’s of the same schema or structure.

Dataframe `union()` – `union()` method of the DataFrame is used to merge two DataFrame’s of the same structure/schema. The output includes all rows from both DataFrames and duplicates are retained. If schemas are not the same it returns an error. To deal with the DataFrames of different schemas we need to use unionByName() transformation.

`dataFrame1.union(dataFrame2)`

DataFrame `unionAll()` – `unionAll()` is deprecated since Spark “2.0.0” version and replaced with union().

`dataFrame1.unionAll(dataFrame2)`


In [2]:

# Imports
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SP').getOrCreate()

simpleData = [("James","Sales","NY",90000,34,10000), \
    ("Michael","Sales","NY",86000,56,20000), \
    ("Robert","Sales","CA",81000,30,23000), \
    ("Maria","Finance","CA",90000,24,23000) \
  ]

columns= ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
# df.show(truncate=False)

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)



In [3]:
# Create DataFrame2
simpleData2 = [("James","Sales","NY",90000,34,10000), \
    ("Maria","Finance","CA",90000,24,23000), \
    ("Jen","Finance","NY",79000,53,15000), \
    ("Jeff","Marketing","CA",80000,25,18000), \
    ("Kumar","Marketing","NY",91000,50,21000) \
  ]
columns2= ["employee_name","department","state","salary","age","bonus"]

df2 = spark.createDataFrame(data = simpleData2, schema = columns2)

df2.printSchema()
# df2.show(truncate=False)

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)



# Merge two or more DataFrames using union


In [4]:
# union() to merge two DF's
union_Df=df.union(df2) 
union_Df.printSchema()

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)



# Merge DataFrames using unionAll

DataFrame unionAll() method is deprecated since PySpark “2.0.0” version and recommends using the union() method.

In [6]:
unionAll_Df= df.unionAll(df2)
# unionAll_Df.show(truncate=False)

# Merge without Duplicates

In [None]:
# Remove duplicates after union() using distinct()
disDF=df.union(df2).distinct()
disDF.show(truncate=False)