## Overview
This notebook shows how to define schema for PySpark Dataframe using different various options.

#### **Contents :**

- **Spark Schema**
- **StructType – Defines the Structure of the DataFrame**
- **StructField – Defines the Metadata of the DataFrame column**
- **Schema Creation**
    1. Create DataFrame using PySpark StructType & StructField
    2. Create Nested StructType object struct using StructType and StructField
    3. Adding & Changing struct of the DataFrame
    4. Using SQL ArrayType and MapType
    5. Creating StructType object struct from JSON file
    6. Creating StructType object struct from DDL String

#### Possible Interview Question
1. **How to create Schema in PySpark ? What are the other ways to create Schema in PySpark ?**
    - we can create the schema of a dataframe via two ways : `StructType` & `StructField` and `DDL`

2. **What is StructType and StructField in Spark Schema ?**
    - `StructType` and `StructField` classes in PySpark are used to specify the custom schema to the DataFrame and create complex columns like nested struct, array, and map columns. 

3. **What if there is Header in Schema ?**
    - Use `option.('SkipRows', 1)` while Read to skip the first row which will be Header in the data.
     

This is a **Python** notebook so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` magic command. `Python`, `Scala(%scala)`, `SQL(%sql)`, `FileStore(%fs)` and `R(%r)` all are supported.

**Spark Dataframe Documentation Link**
- https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#DataFrame-Creation

#### Spark Schema 

- By default, Spark infers the schema from the data, however, sometimes we may need to define our own schema (column names and data types), especially while working with unstructured and semi-structured data.

- **Spark Schema** is the structure of the DataFrame or Dataset, we can define it using StructType class which is a collection of StructField.

- The `StructType` and `StructField` classes in PySpark are used to specify the custom schema to the DataFrame and create complex columns like nested struct, array, and map columns.

- Why Schema Creation is Essential -
  - **Data Consistency:** Defining a schema ensures uniformity in data types across columns, preventing type conflicts during processing.
  - **Data Validation:** Schema creation allows you to impose constraints, ensuring data integrity and eliminating invalid records.
  - **Optimized Performance:** With a defined schema, Spark can allocate memory efficiently, leading to faster data processing.
  - **Type Safety:** Explicit schema creation offers better type safety during runtime, reducing the risk of runtime errors.

- In Spark, we can define the schema of a dataframe via two ways : 
  1. **StructType** and **StructField**  
  2. **DDL**

#### StructType – Defines the Structure of the DataFrame

- PySpark provides `StructType` class from `pyspark.sql.types` to define the structure of the DataFrame.

-  `StructType` is commonly used to define the schema when creating a DataFrame, particularly for structured data with fields of different data types.

- `StructType` represents a schema, which is a collection or list of `StructField` objects. A StructType is essentially a list of fields, each with a name and data type, defining the structure of the DataFrame. It allows for the creation of nested structures and complex data types.

- We can create complex schemas with nested structures by nesting StructType within other StructType objects, allowing you to represent hierarchical or multi-level data.

-  When reading data from various sources, specifying a StructType as the schema ensures that the data is correctly interpreted and structured. This is important when dealing with semi-structured or schema-less data sources.

- PySpark `printSchema()` method on the DataFrame shows StructType columns as struct.

-----

#### StructField – Defines the Metadata of the DataFrame column

- `StructField` represents a field in the schema, containing metadata such as the name, data type, and nullable status of the field. 

- `StructField` define the column name(String), column type (DataType), nullable column (Boolean) and metadata (MetaData).

- Each StructField object defines a single column in the DataFrame, specifying its name and the type of data it holds.

- The `StructField` class is also part of `pyspark.sql.types`.

##### 1. Create DataFrame using PySpark StructType & StructField 

In [0]:
# import statement
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType

In [0]:
# preparing data for creating the DataFrame
data = [("James","","Smith","36636","M",3000),
        ("Michael","Rose","","40288","M",4000),
        ("Robert","","Williams","42114","M",4000),
        ("Maria","Anne","Jones","39192","F",4000),
        ("Jen","Mary","Brown","","F",-1)]

# creating schema with StructType and StructField
schema = StructType([StructField("firstname", StringType(), nullable=True),
                     StructField("middlename", StringType(), nullable=True),
                     StructField("lastname", StringType(), nullable=True),
                     StructField("id", StringType(), nullable=True),
                     StructField("gender", StringType(), nullable=True),
                     StructField("salary", IntegerType(), nullable=True)])
 
# creating dataframe 
df = spark.createDataFrame(data=data, schema=schema)

# display the result 
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+



##### 2. Create Nested StructType object struct using StructType and StructField
To define a nested StructType in PySpark, use inner StructTypes within StructFields. Each nested StructType is a collection of StructFields, forming a hierarchical structure for representing complex data within DataFrames.

In [0]:

# defining nested structure data
structureData = [(("James","","Smith"),"36636","M",3100),
                 (("Michael","Rose",""),"40288","M",4300),
                 (("Robert","","Williams"),"42114","M",1400),
                 (("Maria","Anne","Jones"),"39192","F",5500),
                 (("Jen","Mary","Brown"),"","F",-1)]

# defining schema using nested StructType
structureSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])

# creating dataframe
df2 = spark.createDataFrame(data=structureData, schema=structureSchema)

# display result
df2.printSchema()
df2.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+--------------------+-----+------+------+
|name                |id   |gender|salary|
+--------------------+-----+------+------+
|{James, , Smith}    |36636|M     |3100  |
|{Michael, Rose, }   |40288|M     |4300  |
|{Robert, , Williams}|42114|M     |1400  |
|{Maria, Anne, Jones}|39192|F     |5500  |
|{Jen, Mary, Brown}  |     |F     |-1    |
+--------------------+-----+------+------+



##### 3. Adding & Changing struct of the DataFrame
Using PySpark SQL function `struct()`, we can change the struct of the existing DataFrame and add a new StructType to it. 

The below example demonstrates how to copy the columns from one structure to another and adding a new column. 
[PySpark Column Class](https://sparkbyexamples.com/pyspark/pyspark-column-functions/) also provides some functions to work with the StructType column.

In example it copies “gender“, “salary” and “id” to the new struct “otherInfo” and add’s a new column “Salary_Grade“.

In [0]:
# Updating existing structtype using struct
from pyspark.sql.functions import col,struct,when

updatedDF = df2.withColumn("OtherInfo", struct(col("id").alias("identifier"),
                                               col("gender").alias("gender"),
                                               col("salary").alias("salary"),
                                               when(col("salary").cast(IntegerType()) < 2000,"Low")
                                               .when(col("salary").cast(IntegerType()) < 4000,"Medium")
                                               .otherwise("High").alias("Salary_Grade"))
                           ).drop("id","gender","salary")

updatedDF.printSchema()
updatedDF.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- OtherInfo: struct (nullable = false)
 |    |-- identifier: string (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- salary: integer (nullable = true)
 |    |-- Salary_Grade: string (nullable = false)

+--------------------+------------------------+
|name                |OtherInfo               |
+--------------------+------------------------+
|{James, , Smith}    |{36636, M, 3100, Medium}|
|{Michael, Rose, }   |{40288, M, 4300, High}  |
|{Robert, , Williams}|{42114, M, 1400, Low}   |
|{Maria, Anne, Jones}|{39192, F, 5500, High}  |
|{Jen, Mary, Brown}  |{, F, -1, Low}          |
+--------------------+------------------------+



##### 4. Using SQL ArrayType and MapType
SQL StructType also supports `ArrayType` and `MapType` to define the DataFrame columns for array and map collections, respectively. In the below example, column hobbies defined as ArrayType(StringType) and properties defined as MapType(StringType,StringType) meaning both key and value as String.

In [0]:
# Using SQL ArrayType and MapType
arrayStructureSchema = StructType([
    StructField('name', StructType([
       StructField('firstname', StringType(), True),
       StructField('middlename', StringType(), True),
       StructField('lastname', StringType(), True)
       ])),
       StructField('hobbies', ArrayType(StringType()), True),
       StructField('properties', MapType(StringType(),StringType()), True)
    ])

df3 = spark.createDataFrame([], schema=arrayStructureSchema)
df3.printSchema()
df3.show()

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- hobbies: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+----+-------+----------+
|name|hobbies|properties|
+----+-------+----------+
+----+-------+----------+



##### 5. Creating StructType object struct from JSON file
Alternatively, we can load the SQL StructType schema from JSON file. To make it simple, we will get the current DataFrmae schems using `df2.schema.json()`, store this in a file, and use it to create a schema from this JSON file.

We can also use `df.schema.simpleString()`, this will return a relatively simpler schema format.

In [0]:
%fs
ls FileStore/tables/

path,name,size,modificationTime
dbfs:/FileStore/tables/2010_summary.csv,2010_summary.csv,7121,1728547018000
dbfs:/FileStore/tables/2010_summary_write/,2010_summary_write/,0,0
dbfs:/FileStore/tables/2010_summary_write.csv/,2010_summary_write.csv/,0,0
dbfs:/FileStore/tables/2010_summary_write_02/,2010_summary_write_02/,0,0
dbfs:/FileStore/tables/2011_summary.json,2011_summary.json,21301,1729515377000
dbfs:/FileStore/tables/2011_summary_write.json/,2011_summary_write.json/,0,0
dbfs:/FileStore/tables/NewFile/,NewFile/,0,0
dbfs:/FileStore/tables/RangeFile/,RangeFile/,0,0
dbfs:/FileStore/tables/RangeText/,RangeText/,0,0
dbfs:/FileStore/tables/RangeText.txt/,RangeText.txt/,0,0


In [0]:
# Using json() to load StructType
print(df2.schema.json())
json_write = df2.schema.json()

json_string = json.dumps(json_write)

dbutils.fs.put('/dbfs/FileStore/tables/schema.json', json_string, overwrite=True)

{"fields":[{"metadata":{},"name":"name","nullable":true,"type":{"fields":[{"metadata":{},"name":"firstname","nullable":true,"type":"string"},{"metadata":{},"name":"middlename","nullable":true,"type":"string"},{"metadata":{},"name":"lastname","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"gender","nullable":true,"type":"string"},{"metadata":{},"name":"salary","nullable":true,"type":"integer"}],"type":"struct"}
Wrote 596 bytes.
Out[41]: True

In [0]:

# Loading json schema to create DataFrame
import json
load_json = json.loads('/FileStore/tables/schema.json')
schemaFromJson = StructType.fromJson(load_json)

df4 = spark.createDataFrame(spark.sparkContext.parallelize(structureData),schemaFromJson)
df4.printSchema()

[0;31m---------------------------------------------------------------------------[0m
[0;31mJSONDecodeError[0m                           Traceback (most recent call last)
File [0;32m<command-1017468163024808>:3[0m
[1;32m      1[0m [38;5;66;03m# Loading json schema to create DataFrame[39;00m
[1;32m      2[0m [38;5;28;01mimport[39;00m [38;5;21;01mjson[39;00m
[0;32m----> 3[0m load_json [38;5;241m=[39m json[38;5;241m.[39mloads([38;5;124m'[39m[38;5;124m/FileStore/tables/schema.json[39m[38;5;124m'[39m)
[1;32m      4[0m schemaFromJson [38;5;241m=[39m StructType[38;5;241m.[39mfromJson(load_json)
[1;32m      6[0m df4 [38;5;241m=[39m spark[38;5;241m.[39mcreateDataFrame(spark[38;5;241m.[39msparkContext[38;5;241m.[39mparallelize(structureData),schemaFromJson)

File [0;32m/usr/lib/python3.9/json/__init__.py:346[0m, in [0;36mloads[0;34m(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)[0m
[1;32m    341[0m     s [3

##### 6. Creating StructType object struct from DDL String
To create a StructType object, `struct`, from a `Data Definition Language (DDL)` string in PySpark, use `StructType.fromDDL()`. This method parses the DDL string and generates a StructType object that reflects the schema defined in the string.

For example, ‘struct = StructType.fromDDL(“name STRING, age INT”)’ creates a StructType with two fields: ‘name’ of type ‘STRING’ and ‘age’ of type ‘INT’. This allows for dynamic schema creation based on DDL specifications, facilitating seamless integration with external systems or data sources where schema information is defined using DDL.

In [0]:
import pyspark.sql.types as T      

# Create StructType from DDL String
ddlSchemaStr = "`fullName` STRUCT<`first`: STRING, `last`: STRING, `middle`: STRING>,`age` INT,`gender` STRING"

ddl_schema_string = "col1 string, col2 integer, col3 timestamp"
ddl_schema = T._parse_datatype_string(ddl_schema_string)
ddl_schema

# ddlSchema = StructType.fromDDL(ddlSchemaStr)
# ddlSchema.printTreeString()

Out[46]: StructType([StructField('col1', StringType(), True), StructField('col2', IntegerType(), True), StructField('col3', TimestampType(), True)])