d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Schema Enforcement & Evolution

**Objective:** Work with evolving schema

## Notebook Configuration

Before you run this cell, make sure to add a unique user name to the file
<a href="$./includes/configuration" target="_blank">
includes/configuration</a>, e.g.

```
username = "yourfirstname_yourlastname"
```

In [0]:
%run ./includes/configuration

### Health tracker data sample

```
{"device_id":0,"heartrate":57.6447293596,"name":"Deborah Powell","time":1.5830208E9,"device_type":"version 2"}
{"device_id":0,"heartrate":57.6175546013,"name":"Deborah Powell","time":1.5830244E9,"device_type":"version 2"}
{"device_id":0,"heartrate":57.8486376876,"name":"Deborah Powell","time":1.583028E9,"device_type":"version 2"}
{"device_id":0,"heartrate":57.8821378637,"name":"Deborah Powell","time":1.5830316E9,"device_type":"version 2"}
{"device_id":0,"heartrate":59.0531490807,"name":"Deborah Powell","time":1.5830352E9,"device_type":"version 2"}
```
This shows a sample of the health tracker data we will be using. Note that each line is a valid JSON object.

### Health tracker data schema
The data has the following schema:


| Column     | Type      |
|------------|-----------|
| name       | string    |
| heartrate  | double    |
| device_id  | int       |
| time       | long      |
| device_type| string    |

### Step 1: Load the Next Month of Data
We begin by loading the data from the file `health_tracker_data_2020_3.json`, using the `.format("json")` option as before.

In [0]:
file_path = health_tracker + "raw/health_tracker_data_2020_3.json"


health_tracker_data_2020_3_df = (
  spark.read
  .format("json")
  .load(file_path)
)

### Step 2: Transform the Data

We perform the same data engineering on the data:
- Use the `from_unixtime` Spark SQL function to transform the unixtime into a time string
- Cast the time column to type `timestamp` to replace the column `time`
- Cast the time column to type `date` to create the column `dte`

In [0]:
from pyspark.sql.functions import col, from_unixtime
def process_health_tracker_data(dataframe):
    return (
     dataframe
     .select(
         from_unixtime("time").cast("date").alias("dte"),
         from_unixtime("time").cast("timestamp").alias("time"),
         "heartrate",
         "name",
         col("device_id").cast("integer").alias("p_device_id"),
         "device_type"
       )
     )
processedDF = process_health_tracker_data(health_tracker_data_2020_3_df)



### Step 3: Append the Data to the `health_tracker_processed` Delta table
We do this using `.mode("append")`.

In [0]:
from pyspark.sql.utils import AnalysisException
from pyspark.sql.functions import lit

try:
  (
    processedDF.write
    .mode("append")
    .format("delta")
    .save(health_tracker + "processed")
    .option("mergeSchema", "true")
  )
except AnalysisException as error:
  print("Analysis Exception:")
  print(error)
  

## Schema Mismatch
The command above produces the error: 
```
AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: ...)
```

To enable schema migration using DataFrameWriter or DataStreamWriter, set: `.option("mergeSchema", "true")`.

For other operations, set the session configuration `spark.databricks.delta.schema.autoMerge.enabled` to `"true"`. See [the documentation](https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html) specific to the operation for details.

## What Is Schema Enforcement?
Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema.

## What Is Schema Evolution?

Schema evolution is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. Most commonly, it’s used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns.

#### Step 1: Append the Data with Schema Evolution to the `health_tracker_processed` Delta table
We do this using `.mode("append")`.

In [0]:

(processedDF.write
 .mode("append")
 .option("mergeSchema", True)
 .format("delta")
 .save(health_tracker + "processed"))


## Verify the Commit
### Step 1: Count the Most Recent Version

In [0]:
spark.read.table("health_tracker_processed").count()


-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>