# PySpark `unionByName()` - 
    by Aishwarya Raut

The `pyspark.sql.DataFrame.unionByName()` to merge/union two DataFrames with column names. In PySpark you can easily achieve this using `unionByName()` transformation, this function also takes param allowMissingColumns with the value True if you have a different number of columns on two DataFrames.



# 1. Syntax of unionByName()

In [None]:
# Syntax
unionByName(df,allowMissingColumns=True)

# 2. PySpark unionByName() Usage

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SP').getOrCreate()

# Create DataFrame df1 with columns name, and id
data = [("James",34), ("Michael",56), \
        ("Robert",30), ("Maria",24) ]

df1 = spark.createDataFrame(data = data, schema=["name","id"])
df1.printSchema()

# Create DataFrame df2 with columns name and id
data2=[(34,"James"),(45,"Maria"), \
       (45,"Jen"),(34,"Jeff")]

df2 = spark.createDataFrame(data = data2, schema = ["id","name"])
df2.printSchema()


root
 |-- name: string (nullable = true)
 |-- id: long (nullable = true)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)



In [2]:
# UnionByName() 
df3=df1.unionByName(df2)
df3.printSchema()

root
 |-- name: string (nullable = true)
 |-- id: long (nullable = true)



# 3. Use unionByName() with Different Number of Columns

In [3]:
# Create DataFrames with different column names
df1 = spark.createDataFrame([[5, 2, 6]], ["col0", "col1", "col2"])
df2 = spark.createDataFrame([[6, 7, 3]], ["col1", "col2", "col3"])


In [5]:
# Using allowMissingColumns
df3=df1.unionByName(df2,allowMissingColumns=True)
df3.printSchema()

root
 |-- col0: long (nullable = true)
 |-- col1: long (nullable = true)
 |-- col2: long (nullable = true)
 |-- col3: long (nullable = true)

