# Schema Enforcement

**Schema enforcement**, also known as _**schema validation**_, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table's schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a "reservation"), and rejects any writes with columns that aren't on the list.

## Schema Validation
Delta Lake automatically validates that the schema of the dataframe being written is compatible with the schema of the table. Delta Lake uses the following rules to determine whether a write from a dataframe to a table is compatible:

- **All dataframe columns must exist in the target table**. If there are columns in the dataframe not present in the table, an exception is raised. Columns present in the table but not in the dataframe are set to null.
- **Dataframe column data types must match the column data types in the target table**. If they don’t match, an exception is raised.
- **Dataframe column names cannot differ only by case.** This means that you cannot have columns such as “Foo” and “foo” defined in the same table. 

While you can use Spark in case sensitive or insensitive (default) mode, Parquet is case sensitive when storing and returning column information. 

Delta Lake is case-preserving but insensitive when storing the schema and has this restriction to avoid potential mistakes, data corruption, or loss issues.
Delta Lake support DDL to add new columns explicitly and the ability to update schema automatically.

In [None]:
# Generate dummy data

from pyspark.sql.functions import expr, lit, col
from pyspark.sql.types import *
from datetime import date


df = spark.range(5) \
  .selectExpr("if(id % 2 = 0, 'Open', 'Close') as action") \
  .withColumn("date", expr("cast(concat('2023-06-', cast(rand(5) * 30 as int) + 1) as date)")) \
  .withColumn("device_id", expr("cast(rand(5) * 100 as int)"))

spark.sql("DROP TABLE IF EXISTS demo.device")

delta_table_name = 'device'
df.write.format("delta").mode("overwrite").saveAsTable(delta_table_name)

Showing the **current table schema**

In [None]:
%%sql 

DESCRIBE TABLE demo.device  

Let's perform the same operations and and see how it works 
- **Appending** some data that matches the table schema

In [None]:
deviceSchema = StructType([StructField("action", StringType(), False),
  StructField("date", DateType(), False),
  StructField("device_id", IntegerType(), False),
  ])

data = [
        ('In Progress', date.today(), -1)
    ]  

new_device = spark.createDataFrame(data=data,schema=deviceSchema)

# insert a new row into delta table
new_device.write.format("delta").mode("append").saveAsTable("demo.device")

> OR

In [None]:
%%sql 
INSERT INTO demo.device 
SELECT 'In Progress', current_date(), -1

- **Appending** some data that has a new column
- New dataframe contains a new column named "location"

In [None]:
deviceSchema = StructType([StructField("action", StringType(), False),
  StructField("date", DateType(), False),
  StructField("device_id", IntegerType(), False),
  StructField("location", StringType(), False) # new column
  ])

data = [
        ('In Progress', date.today(), -1, "Dummy location")
    ]  

new_device = spark.createDataFrame(data=data,schema=deviceSchema)

Rather than automatically adding the new columns, **Delta Lake enforces the schema** and stops the write from occurring. 

To help identify which column(s) caused the mismatch, Spark **prints out both schemas** in the stack trace for comparison

In [None]:
# An exception will be thrown: A schema mismatch detected when writing to the Delta table

new_device.write.format("delta") \
                .mode("append") \
                .saveAsTable("demo.device")

## Why is schema enforcement so important?

Because it's such a stringent check, _**schema enforcement is an excellent tool**_ to use as a gatekeeper of a clean, fully transformed data set that is ready for production or consumption. It's typically enforced on tables that directly feed:

- _Machine learning algorithms_
- _BI dashboards_
- _Data analytics and visualization tools_
- _Any production system requiring highly structured, strongly typed, semantic schema_

# Schema Evolution

Schema evolution is a feature that **allows users to easily change** a table's current schema to _accommodate data that is changing over time_. Most commonly, it's used when performing an append or overwrite operation, to _**automatically adapt the schema**_ to include one or more new columns.

You can append a dataframe with a different schema to the delta table by explicitly setting **mergeSchema** equal to **true**

The following types of schema changes are eligible for schema evolution during table appends or overwrites:

- _Adding new columns (this is the most common scenario)_
- _Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType_


Other changes, which are not eligible for schema evolution, require that the schema and data are overwritten by adding **.option("overwriteSchema", "true")**. Those changes include:

- _Dropping a column_
- _Changing an existing column's data type (in place)_
- _Renaming column names that differ only by case (e.g. “Foo” and “foo”)_


In [None]:
new_device.write.format("delta") \
                .mode("append") \
                .option("mergeSchema", True) \
                .saveAsTable("demo.device")

Showing the schema evolution

In [None]:
%%sql 

DESCRIBE TABLE demo.device

## Enable autoMerge

Setting **_mergeSchema_** to true every time you'd like to write with a mismatched schema can be tedious. Let's look at how to enable schema evolution by default.

You can also set a spark property that will enable **autoMerge** by default. Once this property is set, you don't need to manually set **_mergeSchema_** to true when writing data with a different schema to a delta table!

Use **spark.databricks.delta.schema.autoMerge** equal to **true** to enable it and become default setting  spark configuration will


> **Warning**
> Use with caution, as schema enforcement **_will no longer warn you about unintended schema mismatches_**.


In [None]:
spark.conf.get("spark.databricks.delta.schema.autoMerge.enabled")

> You can enable schema evolution by default by setting **autoMerge** to **true**

In [None]:
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Let's create a dataframe with an entirely different schema from the existing Delta table and see what happens when it's appended.

In [None]:
deviceSchema = StructType([StructField("action", StringType(), False),
  StructField("status", StringType(), False) # new column
  ])

data = [
        ('Done', "Good")
    ]  

new_device = spark.createDataFrame(data=data,schema=deviceSchema)

Let's append a single column dataframe to the delta table to illustrate

In [None]:
## .option("mergeSchema", True) is not needed anymore

new_device.write.format("delta") \
                .mode("append") \
                .saveAsTable("demo.device")

In [None]:
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "false")

## Why is schema evolution so import?

Schema evolution can be used anytime you intend  to change the schema of your tables (as opposed to where you accidentally added columns to your dataframe that shouldn't be there). It's the easiest way to migrate your schema because it automatically adds the correct column names and data types, without having to declare them explicitly.

In [None]:
%%sql 

DESCRIBE TABLE demo.device

In [None]:
%%sql 

SELECT * FROM demo.device

## Explicitly update schema 

Change column type or name

In [None]:
df = spark.read.table("demo.device").withColumn("device_id", col("device_id").cast("string"))

In [None]:
df.printSchema()

In [None]:
df.write.format("delta") \
                .mode("overwrite") \
                .option("overwriteschema", True) \
                .saveAsTable("demo.device")

In [None]:
%%sql 

DESCRIBE TABLE demo.device

In [None]:
spark.sql("DROP TABLE IF EXISTS demo.device")