# Schema Specification and RDD Interoperability - Practice Notebook

This notebook covers **Interoperating with RDDs** and **Programmatically Specifying Schema** from the [Spark SQL Getting Started Guide](https://spark.apache.org/docs/latest/sql-getting-started.html).

## Learning Objectives
- Understand RDD to DataFrame conversion
- Learn schema inference vs explicit schema definition
- Practice creating DataFrames with custom schemas
- Work with StructType and StructField
- Handle complex data types

## Sections
1. **Setup and Basic RDD Operations**
2. **Schema Inference from RDDs**
3. **Programmatically Specifying Schema**
4. **Working with Complex Data Types**
5. **Schema Evolution and Validation**
6. **Practice Exercises**

---


In [1]:
# Setup
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import *
from pyspark.sql import functions as F

# Create SparkSession
spark = SparkSession.builder.appName("Schema Specification").getOrCreate()
sc = spark.sparkContext

print("SparkSession and SparkContext created successfully!")
print(f"Spark Version: {spark.version}")


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/13 18:34:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/13 18:34:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


SparkSession and SparkContext created successfully!
Spark Version: 4.0.0


## 1. Basic RDD Operations and DataFrame Conversion

First, let's understand how to work with RDDs and convert them to DataFrames.


In [2]:
# Create RDD from text data (simulating reading from a file)
text_data = [
    "Alice,25,Engineer",
    "Bob,30,Manager",
    "Charlie,35,Engineer",
    "Diana,28,Analyst"
]

In [5]:
# Create RDD
lines_rdd = sc.parallelize(text_data)
print("Original RDD:")
print(lines_rdd.collect())

Original RDD:
['Alice,25,Engineer', 'Bob,30,Manager', 'Charlie,35,Engineer', 'Diana,28,Analyst']


In [6]:
def parse_line(line):
    parts = line.split(',')
    return (parts[0], int(parts[1]), parts[2])

parsed_rdd = lines_rdd.map(parse_line)
print("\nParsed RDD:")
print(parsed_rdd.collect())


Parsed RDD:
[('Alice', 25, 'Engineer'), ('Bob', 30, 'Manager'), ('Charlie', 35, 'Engineer'), ('Diana', 28, 'Analyst')]


                                                                                

In [9]:
df_inferred = spark.createDataFrame(parsed_rdd, ["name","age","job"])
df_inferred.show()

+-------+---+--------+
|   name|age|     job|
+-------+---+--------+
|  Alice| 25|Engineer|
|    Bob| 30| Manager|
|Charlie| 35|Engineer|
|  Diana| 28| Analyst|
+-------+---+--------+



In [10]:
df_inferred.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- job: string (nullable = true)



## 2. Programmatically Specifying Schema

When schema cannot be inferred or needs to be controlled precisely, we can define it programmatically.


In [14]:
custom_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("job", StringType(), True)
])
print("Custom schema definition:")
print(custom_schema)

Custom schema definition:
StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True), StructField('job', StringType(), True)])


In [15]:
# Create DataFrame with custom schema
df_custom_schema = spark.createDataFrame(parsed_rdd, custom_schema)
print("\nDataFrame with custom schema:")
df_custom_schema.show()
df_custom_schema.printSchema()


DataFrame with custom schema:
+-------+---+--------+
|   name|age|     job|
+-------+---+--------+
|  Alice| 25|Engineer|
|    Bob| 30| Manager|
|Charlie| 35|Engineer|
|  Diana| 28| Analyst|
+-------+---+--------+

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- job: string (nullable = true)



In [16]:
# More complex schema example
complex_schema = StructType([
    StructField("employee_id", IntegerType(), False),  # Not nullable
    StructField("personal_info", StructType([
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("email", StringType(), True)
    ]), True),
    StructField("job_details", StructType([
        StructField("title", StringType(), True),
        StructField("salary", DoubleType(), True),
        StructField("start_date", DateType(), True)
    ]), True),
    StructField("skills", ArrayType(StringType()), True)
])

print("\nComplex nested schema:")
print(complex_schema.simpleString())


Complex nested schema:
struct<employee_id:int,personal_info:struct<name:string,age:int,email:string>,job_details:struct<title:string,salary:double,start_date:date>,skills:array<string>>


## 3. Working with Different Data Types

Explore various Spark SQL data types and how to use them in schema definition.


In [19]:
from datetime import date, datetime
from decimal import Decimal

In [20]:
# Create sample data with different types
sample_data = [
    Row(
        id=1,
        name="Alice",
        salary=75000.50,
        is_active=True,
        hire_date=date(2020, 1, 15),
        last_login=datetime(2024, 1, 10, 14, 30, 0),
        bonus=Decimal("5000.25"),
        skills=["Python", "SQL", "Spark"],
        metadata={"department": "Engineering", "level": "Senior"}
    ),
    Row(
        id=2,
        name="Bob",
        salary=85000.75,
        is_active=False,
        hire_date=date(2019, 3, 20),
        last_login=datetime(2024, 1, 9, 9, 15, 0),
        bonus=Decimal("7500.00"),
        skills=["Java", "Scala", "Kafka"],
        metadata={"department": "Engineering", "level": "Lead"}
    )
]

sample_data

[Row(id=1, name='Alice', salary=75000.5, is_active=True, hire_date=datetime.date(2020, 1, 15), last_login=datetime.datetime(2024, 1, 10, 14, 30), bonus=Decimal('5000.25'), skills=['Python', 'SQL', 'Spark'], metadata={'department': 'Engineering', 'level': 'Senior'}),
 Row(id=2, name='Bob', salary=85000.75, is_active=False, hire_date=datetime.date(2019, 3, 20), last_login=datetime.datetime(2024, 1, 9, 9, 15), bonus=Decimal('7500.00'), skills=['Java', 'Scala', 'Kafka'], metadata={'department': 'Engineering', 'level': 'Lead'})]

In [25]:
# Create DataFrame from Row objects (schema inferred)
df_various_types = spark.createDataFrame(sample_data)
print("DataFrame with various data types:")
df_various_types.show(truncate=False)
df_various_types.printSchema()

DataFrame with various data types:
+---+-----+--------+---------+----------+-------------------+-----------------------+--------------------+--------------------------------------------+
|id |name |salary  |is_active|hire_date |last_login         |bonus                  |skills              |metadata                                    |
+---+-----+--------+---------+----------+-------------------+-----------------------+--------------------+--------------------------------------------+
|1  |Alice|75000.5 |true     |2020-01-15|2024-01-10 14:30:00|5000.250000000000000000|[Python, SQL, Spark]|{department -> Engineering, level -> Senior}|
|2  |Bob  |85000.75|false    |2019-03-20|2024-01-09 09:15:00|7500.000000000000000000|[Java, Scala, Kafka]|{department -> Engineering, level -> Lead}  |
+---+-----+--------+---------+----------+-------------------+-----------------------+--------------------+--------------------------------------------+

root
 |-- id: long (nullable = true)
 |-- name: stri

In [26]:
# Define explicit schema for the same data
explicit_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("salary", DoubleType(), True),
    StructField("is_active", BooleanType(), True),
    StructField("hire_date", DateType(), True),
    StructField("last_login", TimestampType(), True),
    StructField("bonus", DecimalType(10, 2), True),
    StructField("skills", ArrayType(StringType()), True),
    StructField("metadata", MapType(StringType(), StringType()), True)
])

print("\nExplicit schema for various data types:")
print(explicit_schema.simpleString())


Explicit schema for various data types:
struct<id:int,name:string,salary:double,is_active:boolean,hire_date:date,last_login:timestamp,bonus:decimal(10,2),skills:array<string>,metadata:map<string,string>>


## 4. Schema Validation and Error Handling

Learn how to handle schema mismatches and validation errors.


In [27]:
# Schema validation examples
print("=== SCHEMA VALIDATION ===")

=== SCHEMA VALIDATION ===


In [29]:
strict_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("salary", DoubleType(), True)
])

In [32]:
valid_data = [(1, "Alice", 75000.0), (2, "Bob", 85000.0)]
df_valid = spark.createDataFrame(valid_data, strict_schema)
df_valid.show()

+---+-----+-------+
| id| name| salary|
+---+-----+-------+
|  1|Alice|75000.0|
|  2|  Bob|85000.0|
+---+-----+-------+



In [33]:
# Try to create DataFrame with invalid data (this will work but may cause issues later)
try:
    invalid_data = [(1, "Alice", 75000.0), (2, None, 85000.0)]  # None in non-nullable field
    df_invalid = spark.createDataFrame(invalid_data, strict_schema)
    print("DataFrame created with invalid data:")
    df_invalid.show()  # This might fail or show unexpected results
except Exception as e:
    print(f"Error: {e}")

Error: [FIELD_NOT_NULLABLE_WITH_NAME] field name: This field is not nullable, but got None.


In [35]:
# Schema comparison
print("\n=== SCHEMA COMPARISON ===")
schema1 = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True)
])

schema2 = StructType([
    StructField("id", LongType(), True),  # Different type
    StructField("name", StringType(), True)
])

print("Schema 1:", schema1.simpleString())
print("Schema 2:", schema2.simpleString())
print("Are schemas equal?", schema1 == schema2)


=== SCHEMA COMPARISON ===
Schema 1: struct<id:int,name:string>
Schema 2: struct<id:bigint,name:string>
Are schemas equal? False


In [36]:
# Working with nullable vs non-nullable fields
print("\n=== NULLABLE VS NON-NULLABLE ===")
nullable_schema = StructType([
    StructField("id", IntegerType(), True),    # Nullable
    StructField("name", StringType(), False)   # Non-nullable
])

print("Nullable schema:", nullable_schema.simpleString())
for field in nullable_schema.fields:
    print(f"Field '{field.name}': nullable={field.nullable}, type={field.dataType}")


=== NULLABLE VS NON-NULLABLE ===
Nullable schema: struct<id:int,name:string>
Field 'id': nullable=True, type=IntegerType()
Field 'name': nullable=False, type=StringType()


## 5. Practice Exercises

Complete these exercises to practice schema specification and RDD operations.


In [37]:
exercise_data = [
    "1,John Doe,Software Engineer,75000,2020-01-15,Python;Java;SQL",
    "2,Jane Smith,Data Scientist,85000,2019-06-20,Python;R;Machine Learning",
    "3,Bob Johnson,DevOps Engineer,80000,2021-03-10,Docker;Kubernetes;AWS",
    "4,Alice Brown,Product Manager,90000,2018-09-05,Agile;Scrum;Analytics"
]

# Create RDD from the data
exercise_rdd = sc.parallelize(exercise_data)

print("=== EXERCISE 1: Parse RDD and Create DataFrame ===")
print("Raw data:")
for line in exercise_data:
    print(line)

# TODO: Complete this exercise
print("\nTODO: Parse the RDD and create a DataFrame")
print("1. Create a function to parse each line")
print("2. Split by comma and handle the skills field (split by semicolon)")
print("3. Convert to appropriate data types")
print("4. Create DataFrame with inferred schema")


=== EXERCISE 1: Parse RDD and Create DataFrame ===
Raw data:
1,John Doe,Software Engineer,75000,2020-01-15,Python;Java;SQL
2,Jane Smith,Data Scientist,85000,2019-06-20,Python;R;Machine Learning
3,Bob Johnson,DevOps Engineer,80000,2021-03-10,Docker;Kubernetes;AWS
4,Alice Brown,Product Manager,90000,2018-09-05,Agile;Scrum;Analytics

TODO: Parse the RDD and create a DataFrame
1. Create a function to parse each line
2. Split by comma and handle the skills field (split by semicolon)
3. Convert to appropriate data types
4. Create DataFrame with inferred schema
