Pyspark StructType & StructField classes are used to progammatically specify the schema to dataframe and create complex columns like nested struct, array, and map columns.

StructType is collection of StructField objects that defines column name, column data type, boolean to specify if the field can be nullable or not and metada

Dataframe Schemas: `StructType` is commonly used to define the schema when creating a DF, particularly for structured data with fields of different data types. 

Nested Structures: create a complex schemas with nested structures by nesting `StructType` within other `StructType` objects, alowing you to represent hierarchical or multi level data. 

Enforcing Data Structure: When reading data from various sources, specifying a `StructType` as the schema ensure that the data is correctly interpreted and structured. This is important when dealing with semi-structured or schema-less data sources. 


1. StructType - Defines the structure of the dataframe
2. StructField- Defines the metadata of the Dataframe column
3. Using PySpark StructType and StructField with Dataframe

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

spark= SparkSession.builder.appName('type_and_field').getOrCreate()
data=[("James","","Smith","36636","M",3000),
     ("Michel","Rose","","40288","M",4000),
     ("Robert","","Williams","45114","M",4000),
     ("Maria","Anne","Jones","39192","F",4000),
     ("Jen","Mary","Brown","","F",-1)]

schema=StructType([
    StructField("firstname",StringType(),True),
    StructField("middlename",StringType(),True),
    StructField("lastname",StringType(),True),
    StructField("id",StringType(),True),
    StructField("gender",StringType(),True),
    StructField("salary",IntegerType(),True)
])

df=spark.createDataFrame(data=data, schema=schema)
df.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



4. Defining Nested StructType object struct

In [8]:
# Defining schema using nested StructType
structureData = [
    (("James","","Smith"),"36636","M",3100),
    (("Michael","Rose",""),"40288","M",4300),
    (("Robert","","Williams"),"42114","M",1400),
    (("Maria","Anne","Jones"),"39192","F",5500),
    (("Jen","Mary","Brown"),"","F",-1)
  ]


structureSchema=StructType([
    StructField('name',StructType([
        StructField('firstname',StringType(),True),
        StructField('middlename',StringType(),True),
        StructField('lastname',StringType(),True)
    ])),
    StructField('id',StringType(),True),
    StructField('gender',StringType(),True),
    StructField("salary",IntegerType(),True)
])
df2 = spark.createDataFrame(data=structureData,schema=structureSchema)
df2.printSchema()

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



5. Adding & Changing struct of the DataFrame

In [9]:
# Updating existing structtype using struct
from pyspark.sql.functions import col,struct,when
updateddf=df2.withColumn("OtherInfo",
                        struct(col("id").alias("identifier"),
                              col("gender").alias("gender"),
                              col("salary").alias("salary"),
                              when(col("salary").cast(IntegerType())<2000,"Low")
                               .when(col("salary").cast(IntegerType())<4000,"Medium")
                               .otherwise("High").alias("Salary_Grade")
                              )).drop("id","gender","salary")

updateddf.printSchema()

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- OtherInfo: struct (nullable = false)
 |    |-- identifier: string (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- salary: integer (nullable = true)
 |    |-- Salary_Grade: string (nullable = false)



6. Using SQL ArrayType and MapType

SQL StructType also supports `ArrayType` and `MapType` to define the DataFrame columns for array and map collections respectively. On the below example, column hobbies defined as `ArrayType`(StringType) and properties defined as `MapType`(StringType,StringType) meaning both key and value as String

In [17]:
# using SQL ArrayType & MapType
from pyspark.sql.types import ArrayType,MapType
ArrayStructureSchema = StructType([
    StructField('name',StructType([
        StructField("firstname",StringType(),True),
        StructField("middlename",StringType(),True),
        StructField("lastname",StringType(),True)
    ])),
    StructField('hobbies',ArrayType(StringType()),True),
    StructField("properties",MapType(StringType(),StringType()),True)
])
ArrayStructureSchema

StructType([StructField('name', StructType([StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True)]), True), StructField('hobbies', ArrayType(StringType(), True), True), StructField('properties', MapType(StringType(), StringType(), True), True)])