## `GLAB 345.2.1: PYSQL - StructType & StructField`

#### In this lab, you will work with structured data in PySpark. You will define a schema using StructType and StructField to specify the structure of your data, including column names and data types. Then, you will create a PySpark DataFrame using this schema() method and populate it with sample student data containing various data types (string, integer, and float). Finally, you will learn how to display the schema and its fields and print the schema in a tree format for better readability.


In [5]:
import pyspark
from pyspark.sql import SparkSession
# And import struct types and data types
from pyspark.sql.types import StructType,StructField,StringType,IntegerType,FloatType
spark_app = SparkSession.builder.appName("sparkdemo").getOrCreate()

# Create student data with 5 rows and 6 attributes 
students =[['001', 'john', 23, 5.79, 67, 'NY'], ['002', 'James', 18, 3.79, 34, 'NY'], ['003', 'Eric', 17, 2.79, 17, 'NJ' ]] 

[['004', 'Shahparan', 19, 3.69, 28, 'NJ'],['005', 'Flex', 37, 5.59, 54, 'Dallas']]  

# Define the StructType and StructFields 
# For the below column names 
schema=StructType([
    StructField("rollno",StringType(),True),
    StructField("name",StringType(),True), 
    StructField("age",IntegerType(),True),
    StructField("height", FloatType(), True),
    StructField("weight", IntegerType(), True),
    StructField("address", StringType(), True) ])

# Create the dataframe and add schema to the dataframe 
df = spark_app.createDataFrame(students, schema=schema) 

# Show data 
df.show()


+------+-----+---+------+------+-------+
|rollno| name|age|height|weight|address|
+------+-----+---+------+------+-------+
|   001| john| 23|  5.79|    67|     NY|
|   002|James| 18|  3.79|    34|     NY|
|   003| Eric| 17|  2.79|    17|     NJ|
+------+-----+---+------+------+-------+



In [None]:
# This will return the DataFrame type, along with columns. 
df.schema
# Since I've created a DataFrame called df, I can access its schema like this. 
# I don't need parenthesis because .schema is a property, not a function. 

df.schema.fields # Displays the fields 

df.printSchema # Display the schema in a tree format 

<bound method DataFrame.printSchema of DataFrame[rollno: string, name: string, age: int, height: float, weight: int, address: string]>

In [18]:
# Create the dataframe and add schema to the dataframe 
df = spark_app.createDataFrame(students, schema=schema)
# Display the schema 
print(df.schema) 

StructType([StructField('rollno', StringType(), True), StructField('name', StringType(), True), StructField('age', IntegerType(), True), StructField('height', FloatType(), True), StructField('weight', IntegerType(), True), StructField('address', StringType(), True)])


In [19]:
# Display the schema fields 
print(df.schema.fields)

[StructField('rollno', StringType(), True), StructField('name', StringType(), True), StructField('age', IntegerType(), True), StructField('height', FloatType(), True), StructField('weight', IntegerType(), True), StructField('address', StringType(), True)]


In [20]:
# Display the schema in a tree format 
df.printSchema

<bound method DataFrame.printSchema of DataFrame[rollno: string, name: string, age: int, height: float, weight: int, address: string]>