**Reading: User-Defined Schema (UDS) for DSL and SQL**

How to Define and Enforce a User-Defined Schema in PySpark?

In this reading, you will learn how to define and enforce a user-defined schema in PySpark.

Spark provides a structured data processing framework that can define and enforce schemas for various data sources, including CSV files. Let's look at the steps to define and use a user-defined schema for a CSV file in PySpark:

Step 1:

Import the required libraries.

In [None]:
from pyspark.sql.types import StructType, IntegerType, FloatType, StringType, StructField

**Step 2:**

Define the schema.

Understanding the data before defining a schema is an important step.

Let's take a look at the step-by-step approach to understanding the data and defining an appropriate schema for a given input file:

Explore the data: Understand the different data types present in each column.

Column data types: Determine the appropriate data types for each column based on your observed values.

Define the schema: Use the 'StructType' class in Spark and create a 'StructField' for each column, mentioning the column name, data type, and other properties.

Example:

In [None]:
schema = StructType([
    StructField("Emp_Id", StringType(), False),
    StructField("Emp_Name", StringType(), False),
    StructField("Department", StringType(), False),
    StructField("Salary", IntegerType(), False),
    StructField("Phone", IntegerType(), True),
])

'False' indicates null values are NOT allowed for the column.

The schema defined above can be utilized for the below CSV file data:

Filename: employee.csv

In [None]:
emp_id,emp_name,dept,salary,phone
A101,jhon,computer science,1000,+1 (701) 846 958
A102,Peter,Electronics,2000,
A103,Micheal,IT,2500,

**Step 3: Read the input file with user-defined schema.**

In [None]:
#create a dataframe on top a csv file
df = (spark.read
  .format("csv")
  .schema(schema)
  .option("header", "true")
  .load("employee.csv")
)
# display the dataframe content
df.show()

**Step 4: Use the printSchema() method in Spark to display the schema of a DataFrame and ensure that the schema is applied correctly to the data.**

In [None]:
df.printSchema()